interesting - i will try this and let you know because it was set to windows encoding (why on earth!?)
On Thu, 2009-10-29 at 20:35 -0400, Fuad Efendi wrote: > > > i dont have any special requirement for any special characters, i am > > > happy with usual utf-8 > > > > > > any suggestion on the best way to configure this correctly; everything > > > seems quite ok looking at the code not sure whats missing. > > > Try to set UTF-8 in configuration file: > parser.character.encoding.default = UTF-8 > > > > > -----Original Message----- > > From: Fuad Efendi [mailto:f...@efendi.ca] > > Sent: October-29-09 8:19 PM > > To: nutch-user@lucene.apache.org; fa...@butterflycluster.net > > Subject: RE: char encoding > > > > Is it "?" or "¿" (Inverted Question Mark)? > > > > Because ¿ is replacement for character codes not having representation in > > specific encoding scheme; you may get it, for instance, if binary stream > is > > UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) not > > having representation in windows-1252 will be represented as "¿". > > > > Nutch tries on the best effort; however, it can't use dedicated CPU as > > browsers.... I agree with Ken. Browsers may fully ignore headers/meta and > > sniff and analyze byte array to find correct encoding (in case, for > > instance, if byte stream is UTF-8, and http/meta is windows-1252). Nutch > > can't do that (it requires a lot of CPU). > > > > Windows-1252 -s default scheme for html-parser in case if Nutch can't find > > correct HTTP/META... > > > > > > From HtmlParser API: > > * We need to do something similar to what's done by mozilla > > * > > > (http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp# > > 1993). > > * See also http://www.w3.org/TR/REC-xml/#sec-guessing > > > > > > private static String sniffCharacterEncoding(byte[] content) {...} > > > > - it doesn't currently use HTTP Headers. > > - it tries to find META tag in first 2000 bytes. > > > > > > So, for instance, some weird sites (such as AJAX/Portals) may have a lot > of > > generated JavaScript before META tag; 2000 could be small. > > > > Then, EncodingDetector is called: > > detector.addClue(sniffCharacterEncoding(contentInOctets), > "sniffed"); > > > > - but it doen't make sense... > > > > > > public String guessEncoding(Content content, String defaultValue) { > > /* > > * This algorithm could be replaced by something more sophisticated; > > * ideally we would gather a bunch of data on where various clues > > * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each > > with > > * the correct answer, and use machine learning/some statistical > method > > * to generate a better heuristic. > > */ > > > > > > > > TODO list... as a workaround, please check for this site that META could > be > > found in first 2000 bytes... > > > > > > > > -Fuad > > http://www.linkedin.com/in/liferay > > > > > > > -----Original Message----- > > > From: Fadzi Ushewokunze [mailto:fa...@butterflycluster.net] > > > Sent: October-29-09 7:05 PM > > > To: nutch-user@lucene.apache.org > > > Subject: char encoding > > > > > > hi there, > > > > > > i am having issues with the HTMLParser failing to detect the char > > > encoding. so lots of non alpha-numeric chars end up as "?" ; > > > > > > i dont have any special requirement for any special characters, i am > > > happy with usual utf-8 > > > > > > any suggestion on the best way to configure this correctly; everything > > > seems quite ok looking at the code not sure whats missing. > > > > > > thanks. > > > > > > > > > > > > >