[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

Doug Cook (JIRA) Fri, 20 Jul 2007 19:11:37 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514382
 ]


Doug Cook commented on NUTCH-25:
--------------------------------

Oops, spoke to soon. On running a more extensive test, I saw quite a few 
ArrayIndexOutOfBound errors coming from ICU4J. Most were for index 0, some were 
not. 

The index 0 ones seem explainable by passing in content that is too short (see: 
http://bugs.icu-project.org/trac/ticket/5596). This was easily fixed. Then 
there were problems from non-zero indices; I don't understand why these happen, 
but in any case, they should not cause the entire fetch to fail, so I added a 
try/catch around the call to ICU4J; failures will now fall back to the previous 
methods (the response header or sniffing, as appropriate).

The new check follows. When this crawl finishes I will look for any more subtle 
errors in my relevance tests.

    String encoding = null;
    if (minConfidence >= 0 && DETECTABLES.contains(getContentType()) && 
content.length > 4) {
      detector.enableInputFilter(true);
      detector.setText(content);
      CharsetMatch match = null;
      try {
        match = detector.detect();
      }
      catch (Exception e) {}
        
      if (LOG.isTraceEnabled()) {
        LOG.trace("Detected: confidence="+match.getConfidence());
      }
      if (match != null && match.getConfidence() >= minConfidence)
        encoding = match.getName();
    }

    if (encoding != null) {
      metadata.set(Metadata.DETECTED_ENCODING, encoding);
    }
  }


> needs 'character encoding' detector
> -----------------------------------
>
>                 Key: NUTCH-25
>                 URL: https://issues.apache.org/jira/browse/NUTCH-25
>             Project: Nutch
>          Issue Type: Wish
>            Reporter: Stefan Groschupf
>            Priority: Trivial
>         Attachments: NUTCH-25_draft.patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

Reply via email to