[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514375 ]
Doug Cook commented on NUTCH-25: -------------------------------- Hi, Doğacan. My sincere apologies for the slow response, especially given the alacrity with which you whipped up that patch. I had to back-port the patch to my 0.81 environment for testing, so I can't 100% guarantee that your patch works as-is on 0.9. At any rate, in my environment, it seems to work pretty well, at least in my limited testing, and I didn't see any obvious problems on code review. I was using a 50% confidence threshold and most of the time the detection code kicked in (with the correct answer). All of the documents I was having problems with were fine. There seemed to be a typo in the patch; there's a try statement missing here, if I read correctly, but I just put in a try and took out the funky isTraceEnabled(), and all was well: - true); - } catch (SAXException e) {} + LOG.isTraceEnabled()); + parser.setProperty("http://cyberneko.org/html/properties/default-encoding", defaultCharEncoding); + parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true); + } catch (SAXException e) { + LOG.trace(e); + } My only (minor) suggestion would be to change the LOG.trace statements in HtmlParser to note how they determined the encoding, e.g.: if (LOG.isTraceEnabled()) { LOG.trace(base + ": setting encoding to (DETECTED) " + encoding); } That way one can look at the logs and see how often each of the 3 methods (detection, response header, sniffing) is used. Thanks again for the patch; it's good stuff, and useful. > needs 'character encoding' detector > ----------------------------------- > > Key: NUTCH-25 > URL: https://issues.apache.org/jira/browse/NUTCH-25 > Project: Nutch > Issue Type: Wish > Reporter: Stefan Groschupf > Priority: Trivial > Attachments: NUTCH-25_draft.patch > > > transferred from: > http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356 > submitted by: > Jungshik Shin > this is a follow-up to bug 993380 (figure out 'charset' > from the meta tag). > Although we can cover a lot of ground using the 'C-T' > header field in in the HTTP header and the > corresponding meta tag in html documents (and in case > of XML, we have to use a similar but a different > 'parsing'), in the wild, there are a lot of documents > without any information about the character encoding > used. Browsers like Mozilla and search engines like > Google use character encoding detectors to deal with > these 'unlabelled' documents. > Mozilla's character encoding detector is GPL/MPL'd and > we might be able to port it to Java. Unfortunately, > it's not fool-proof. However, along with some other > heuristic used by Mozilla and elsewhere, it'll be > possible to achieve a high rate of the detection. > The following page has links to some other related pages. > http://trainedmonkey.com/week/2004/26 > In addition to the character encoding detection, we > also need to detect the language of a document, which > is even harder and should be a separate bug (although > it's related). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers