Is it "?" or "¿" (Inverted Question Mark)? Because ¿ is replacement for character codes not having representation in specific encoding scheme; you may get it, for instance, if binary stream is UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) not having representation in windows-1252 will be represented as "¿".
Nutch tries on the best effort; however, it can't use dedicated CPU as browsers.... I agree with Ken. Browsers may fully ignore headers/meta and sniff and analyze byte array to find correct encoding (in case, for instance, if byte stream is UTF-8, and http/meta is windows-1252). Nutch can't do that (it requires a lot of CPU). Windows-1252 -s default scheme for html-parser in case if Nutch can't find correct HTTP/META... >From HtmlParser API: * We need to do something similar to what's done by mozilla * (http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp# 1993). * See also http://www.w3.org/TR/REC-xml/#sec-guessing private static String sniffCharacterEncoding(byte[] content) {...} - it doesn't currently use HTTP Headers. - it tries to find META tag in first 2000 bytes. So, for instance, some weird sites (such as AJAX/Portals) may have a lot of generated JavaScript before META tag; 2000 could be small. Then, EncodingDetector is called: detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed"); - but it doen't make sense... public String guessEncoding(Content content, String defaultValue) { /* * This algorithm could be replaced by something more sophisticated; * ideally we would gather a bunch of data on where various clues * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each with * the correct answer, and use machine learning/some statistical method * to generate a better heuristic. */ TODO list... as a workaround, please check for this site that META could be found in first 2000 bytes... -Fuad http://www.linkedin.com/liferay > -----Original Message----- > From: Fadzi Ushewokunze [mailto:fa...@butterflycluster.net] > Sent: October-29-09 7:05 PM > To: nutch-user@lucene.apache.org > Subject: char encoding > > hi there, > > i am having issues with the HTMLParser failing to detect the char > encoding. so lots of non alpha-numeric chars end up as "?" ; > > i dont have any special requirement for any special characters, i am > happy with usual utf-8 > > any suggestion on the best way to configure this correctly; everything > seems quite ok looking at the code not sure whats missing. > > thanks. > >