RE: char encoding

Fuad Efendi Thu, 29 Oct 2009 17:19:20 -0700

Is it "?" or "¿" (Inverted Question Mark)?

Because ¿ is replacement for character codes not having representation in
specific encoding scheme; you may get it, for instance, if binary stream is
UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) not
having representation in windows-1252 will be represented as "¿".


Nutch tries on the best effort; however, it can't use dedicated CPU as
browsers.... I agree with Ken. Browsers may fully ignore headers/meta and
sniff and analyze byte array to find correct encoding (in case, for
instance, if byte stream is UTF-8, and http/meta is windows-1252). Nutch
can't do that (it requires a lot of CPU).

Windows-1252 -s default scheme for html-parser in case if Nutch can't find
correct HTTP/META...


>From HtmlParser API:
   * We need to do something similar to what's done by mozilla
   *
(http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp#
1993).
   * See also http://www.w3.org/TR/REC-xml/#sec-guessing


private static String sniffCharacterEncoding(byte[] content) {...}

- it doesn't currently use HTTP Headers.
- it tries to find META tag in first 2000 bytes.


So, for instance, some weird sites (such as AJAX/Portals) may have a lot of
generated JavaScript before META tag; 2000 could be small. 

Then, EncodingDetector is called:
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");

- but it doen't make sense...


  public String guessEncoding(Content content, String defaultValue) {
    /*
     * This algorithm could be replaced by something more sophisticated;
     * ideally we would gather a bunch of data on where various clues
     * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each
with
     * the correct answer, and use machine learning/some statistical method
     * to generate a better heuristic.
     */



TODO list... as a workaround, please check for this site that META could be
found in first 2000 bytes...



-Fuad
http://www.linkedin.com/liferay


> -----Original Message-----
> From: Fadzi Ushewokunze [mailto:fa...@butterflycluster.net]
> Sent: October-29-09 7:05 PM
> To: nutch-user@lucene.apache.org
> Subject: char encoding
> 
> hi there,
> 
> i am having issues with the HTMLParser failing to detect the char
> encoding. so lots of non alpha-numeric chars end up as "?" ;
> 
> i dont have any special requirement for any special characters, i am
> happy with usual utf-8
> 
> any suggestion on the best way to configure this correctly; everything
> seems quite ok looking at the code not sure whats missing.
> 
> thanks.
> 
>

RE: char encoding

Reply via email to