[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514426 ]
Doug Cook commented on NUTCH-25: -------------------------------- Not sure where this belongs architecturally and aesthetically -- will think about that. The relevance test results look good -- overall at least as good as prior. The histogram of confidence values from ICU4J on a ~60K doc test DB looks something like: 0-9 6 10-19 440 20-29 2466 30-39 7724 40-49 11372 50-59 10791 60-69 9583 70-79 4519 80-89 4479 90-99 386 I did find a small number of cases where high-ish (>50%) confidence detection was wrong: http://viniform.typepad.fr/dn/2006/10/mise_jour_du_cl.html http://www.buscamaniban.com/fr/patrimoine/coeur-armagnac.php http://www.lafite.com/en/html/Corporate/1.html http://www.franz-keller.de/8860.html http://www.vinesnwines.org/?m=200605 In all these cases, ICU4J guessed Latin-1, while the page was (correctly) reported(*) or sniffed(*) to be UTF-8. That said, overall ICU4J seems to perform quite well. In addition to the overall relevance tests, I used a search for the word fragment "teau," which occurs frequently when the word Château is parsed with the wrong encoding (making Ch + garbage + teau). Prior to the patch I saw 102 occurrences; afterwards I saw 69 occurrences. And many of these 69 seemed to be on pages where the page had mixed encodings, or had typos, so it shows up that way even in the browser. Also, many of the remaining pages were text files or RSS feeds (parsed by TextParser, which I haven't yet adapted to use the encoding detection; doing that now). Architecturally I think we should store the detected encoding AND the confidence in all cases (even when low), instead of storing it only when the confidence meets some threshold. That way the decision of which value to use can be made later, in the parser, which can make a "smart" decision based upon all the data that's available (detected, sniffed, reported, plus confidence value on detection). Then, for example, if there is no sniffed or reported value, we could use the detected value, even if the confidence is low (especially useful in the TextParser). We could also make decisions like "the confidence is medium, but the same value is both sniffed and reported, so let's trust that instead," which might fix some of the detection problem cases. Hope this all makes sense. I'll keep plugging away at this today and report back on what I find. Thanks for all the help and quick responses. Doug (*) By "reported," I mean in the HTTP header, and by "sniffed," I mean specified in the page metatags (since this is the term used in the code). > needs 'character encoding' detector > ----------------------------------- > > Key: NUTCH-25 > URL: https://issues.apache.org/jira/browse/NUTCH-25 > Project: Nutch > Issue Type: New Feature > Reporter: Stefan Groschupf > Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: NUTCH-25.patch, NUTCH-25_draft.patch > > > transferred from: > http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356 > submitted by: > Jungshik Shin > this is a follow-up to bug 993380 (figure out 'charset' > from the meta tag). > Although we can cover a lot of ground using the 'C-T' > header field in in the HTTP header and the > corresponding meta tag in html documents (and in case > of XML, we have to use a similar but a different > 'parsing'), in the wild, there are a lot of documents > without any information about the character encoding > used. Browsers like Mozilla and search engines like > Google use character encoding detectors to deal with > these 'unlabelled' documents. > Mozilla's character encoding detector is GPL/MPL'd and > we might be able to port it to Java. Unfortunately, > it's not fool-proof. However, along with some other > heuristic used by Mozilla and elsewhere, it'll be > possible to achieve a high rate of the detection. > The following page has links to some other related pages. > http://trainedmonkey.com/week/2004/26 > In addition to the character encoding detection, we > also need to detect the language of a document, which > is even harder and should be a separate bug (although > it's related). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers