[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515365 ]
Doğacan Güney commented on NUTCH-25: ------------------------------------ [snip snip] > Internal to guessEncoding, we could certainly add the clue values if it turns > out that helps us make a better guess. > Combining clues prior to guessEncoding is throwing away information -- clues > might be additive, but they might not (two highly correlated pieces of data > won't > be additive, and inversely correlated features will even be "subtractive"). > [...] This is what I was talking about. We can allow users to specifiy 'additiveness' of clues but that may make API unnecessarily complex. I think for now just adding confidence values in guessEncoding should be good enough. > [...] Ideally someone could make a large-ish test set, judge the "real" > encoding for all the examples, do the statistics, and find out how all the > (detected encoding, header value, metatags) interact. A guessEncoding based > on statistical modeling would be pretty sweet. When I was working for a > certain > large search company, this is how we would typically tackle a problem like > this. [snip snip] This is one of the things that would incredibly benefit nutch. Unfortunately, I don't think that we have nearly enough resources for it :( . > It's worth adding that CharsetDetector also detects languages, and a few > examples I looked at seemed pretty good. It seems a shame to throw away that > information, especially when I know Nutch's built-in language detection makes > a fair number of mistakes (though in part because it trusts the page > metatags, which are often wrong). Another bit of food for thought. Sami Siren suggested this a while ago but I didn't see where icu4j does the language detection (sorry Sami!) . Can you provide a link on icu4j's language detection? I agree with you that most of the mistakes language detection makes comes from its 'trusting' nature. I would actually go a bit further and say this: Any code (at least for nutch) that trusts input without validating it is inherently wrong :). Because we are dealing with Web here and that's just the way of things in WWW. This includes, off the top of my head, encoding detection, language detection, content-type (mime-type) detection. Btw, I forgot to say this in my previous comment, so here it is: * stripGarbage method won't work for non-byte oriented encodings (such as UTF-16). UTF-16 uses at least to bytes for a single character and it is possible that the first or second byte of it may be '<' even though the represented character is something else. Mozilla has some code used for detecting byte orders (there is a link somewhere in parse-html). I actually ported that code to java but never got to test it. If I can find the patch, it may be useful to add it to EncodingDetector. Also, I am not an expert on charsets, but I think for all byte oriented encodings, the first 127 (or so) characters are the same. So you can 'cast' the given byte array to ASCII safely (I am not suggesting that you should, just saying that it is doable). > needs 'character encoding' detector > ----------------------------------- > > Key: NUTCH-25 > URL: https://issues.apache.org/jira/browse/NUTCH-25 > Project: Nutch > Issue Type: New Feature > Reporter: Stefan Groschupf > Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: EncodingDetector.java, NUTCH-25.patch, > NUTCH-25_draft.patch, patch > > > transferred from: > http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356 > submitted by: > Jungshik Shin > this is a follow-up to bug 993380 (figure out 'charset' > from the meta tag). > Although we can cover a lot of ground using the 'C-T' > header field in in the HTTP header and the > corresponding meta tag in html documents (and in case > of XML, we have to use a similar but a different > 'parsing'), in the wild, there are a lot of documents > without any information about the character encoding > used. Browsers like Mozilla and search engines like > Google use character encoding detectors to deal with > these 'unlabelled' documents. > Mozilla's character encoding detector is GPL/MPL'd and > we might be able to port it to Java. Unfortunately, > it's not fool-proof. However, along with some other > heuristic used by Mozilla and elsewhere, it'll be > possible to achieve a high rate of the detection. > The following page has links to some other related pages. > http://trainedmonkey.com/week/2004/26 > In addition to the character encoding detection, we > also need to detect the language of a document, which > is even harder and should be a separate bug (although > it's related). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers