[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514382 ]
Doug Cook commented on NUTCH-25: -------------------------------- Oops, spoke to soon. On running a more extensive test, I saw quite a few ArrayIndexOutOfBound errors coming from ICU4J. Most were for index 0, some were not. The index 0 ones seem explainable by passing in content that is too short (see: http://bugs.icu-project.org/trac/ticket/5596). This was easily fixed. Then there were problems from non-zero indices; I don't understand why these happen, but in any case, they should not cause the entire fetch to fail, so I added a try/catch around the call to ICU4J; failures will now fall back to the previous methods (the response header or sniffing, as appropriate). The new check follows. When this crawl finishes I will look for any more subtle errors in my relevance tests. String encoding = null; if (minConfidence >= 0 && DETECTABLES.contains(getContentType()) && content.length > 4) { detector.enableInputFilter(true); detector.setText(content); CharsetMatch match = null; try { match = detector.detect(); } catch (Exception e) {} if (LOG.isTraceEnabled()) { LOG.trace("Detected: confidence="+match.getConfidence()); } if (match != null && match.getConfidence() >= minConfidence) encoding = match.getName(); } if (encoding != null) { metadata.set(Metadata.DETECTED_ENCODING, encoding); } } > needs 'character encoding' detector > ----------------------------------- > > Key: NUTCH-25 > URL: https://issues.apache.org/jira/browse/NUTCH-25 > Project: Nutch > Issue Type: Wish > Reporter: Stefan Groschupf > Priority: Trivial > Attachments: NUTCH-25_draft.patch > > > transferred from: > http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356 > submitted by: > Jungshik Shin > this is a follow-up to bug 993380 (figure out 'charset' > from the meta tag). > Although we can cover a lot of ground using the 'C-T' > header field in in the HTTP header and the > corresponding meta tag in html documents (and in case > of XML, we have to use a similar but a different > 'parsing'), in the wild, there are a lot of documents > without any information about the character encoding > used. Browsers like Mozilla and search engines like > Google use character encoding detectors to deal with > these 'unlabelled' documents. > Mozilla's character encoding detector is GPL/MPL'd and > we might be able to port it to Java. Unfortunately, > it's not fool-proof. However, along with some other > heuristic used by Mozilla and elsewhere, it'll be > possible to achieve a high rate of the detection. > The following page has links to some other related pages. > http://trainedmonkey.com/week/2004/26 > In addition to the character encoding detection, we > also need to detect the language of a document, which > is even harder and should be a separate bug (although > it's related). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers