[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514433 ]
Doğacan Güney commented on NUTCH-25: ------------------------------------ Doug, thanks for the (very) detailed feedback! This is incredibly helpful. > I did find a small number of cases where high-ish (>50%) confidence detection > was wrong: > http://viniform.typepad.fr/dn/2006/10/mise_jour_du_cl.html > http://www.buscamaniban.com/fr/patrimoine/coeur-armagnac.php > http://www.lafite.com/en/html/Corporate/1.html > http://www.franz-keller.de/8860.html > http://www.vinesnwines.org/?m=200605 Unfortunately, it seems there is not much we can do about these. I tried adding a detector.setDeclaredEncoding("UTF-8") before detection and it didn't help (UTF-8 confidence is surprisingly low, around 25). I also tried jchardet ( http://jchardet.sourceforge.net/ ) with these pages and it doesn't detect them as UTF-8 either, which is strange considering that mozilla does detect them correctly. > Architecturally I think we should store the detected encoding AND the > confidence in all cases (even when low), instead of storing it only when the > > confidence meets some threshold. That way the decision of which value to > use can be made later, in the parser, which can make a "smart" > decision based upon all the data that's available (detected, sniffed, > reported, plus confidence value on detection). Then, for example, if there is > no sniffed or reported value, we could use the detected value, even if the > confidence is low (especially useful in the TextParser). We could also > make decisions like "the confidence is medium, but the same value is both > sniffed and reported, so let's trust that instead," which might fix some > of > the detection problem cases. Good idea but implementation-wise I would suggest that we rip out the detection code from Content.java and move it into parse-html (and whatever else wants to detect encoding). There will be some code duplication but this way parse-html can get all the possible matches (via detector.detectAll) and then use sniffed and reported to make a decision. For example, auto-detect may return (ISO-8859-1, 60), (UTF-8, 25), sniffed and reported may both be UTF-8. Based on this data, parse-html can then choose UTF-8 (an idea would be to make sniffed and reported to add a delta, say 20, confidence). What do you think? > needs 'character encoding' detector > ----------------------------------- > > Key: NUTCH-25 > URL: https://issues.apache.org/jira/browse/NUTCH-25 > Project: Nutch > Issue Type: New Feature > Reporter: Stefan Groschupf > Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: NUTCH-25.patch, NUTCH-25_draft.patch > > > transferred from: > http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356 > submitted by: > Jungshik Shin > this is a follow-up to bug 993380 (figure out 'charset' > from the meta tag). > Although we can cover a lot of ground using the 'C-T' > header field in in the HTTP header and the > corresponding meta tag in html documents (and in case > of XML, we have to use a similar but a different > 'parsing'), in the wild, there are a lot of documents > without any information about the character encoding > used. Browsers like Mozilla and search engines like > Google use character encoding detectors to deal with > these 'unlabelled' documents. > Mozilla's character encoding detector is GPL/MPL'd and > we might be able to port it to Java. Unfortunately, > it's not fool-proof. However, along with some other > heuristic used by Mozilla and elsewhere, it'll be > possible to achieve a high rate of the detection. > The following page has links to some other related pages. > http://trainedmonkey.com/week/2004/26 > In addition to the character encoding detection, we > also need to detect the language of a document, which > is even harder and should be a separate bug (although > it's related). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers