[ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514438
 ] 

Doug Cook commented on NUTCH-25:
--------------------------------

As far as the problem cases, I'm running a test now on my test DB (the ~60K doc 
one), and I'm going to take a random sample of the discrepancies between 
detected/reported/sniffed, look at the correct value for each, and see if there 
is a heuristic we can use to combine all 3 and do a little better than just 
using the detection on its own. Perhaps this is what Mozilla does.

I'll also play with setDeclaredEncoding and see if that helps at all on the 
larger data set. (I didn't know there was one, thanks for pointing that out! 
That's what I get for not looking at the icu4j docs ;-)

I've integrated detection into the TextParser as well, and rewritten the 
choosing logic in HtmlParser (both using unsurprisingly similar code, which 
suggests a utility class, as you suggest as well). Testing those now.

It's not a bad idea to move detection out of the Content class; this could be 
part of the proposed utility class for character detection. Thus, this class 
could encapsulate (a) running charset detection, and (b) choosing the most 
likely "correct" charset for a document given a number of inputs (detected, 
reported, etc. depending on content type). Then the code duplication across 
different parsers would be minimal; in fact, their current code might get 
shorter, if we have the right abstraction.

d

> needs 'character encoding' detector
> -----------------------------------
>
>                 Key: NUTCH-25
>                 URL: https://issues.apache.org/jira/browse/NUTCH-25
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Stefan Groschupf
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-25.patch, NUTCH-25_draft.patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to