[ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514426
 ] 

Doug Cook commented on NUTCH-25:
--------------------------------

Not sure where this belongs architecturally and aesthetically -- will think 
about that.

The relevance test results look good -- overall at least as good as prior. 

The histogram of confidence values from ICU4J on a ~60K doc test DB looks 
something like:
         0-9   6
      10-19   440
      20-29   2466
      30-39   7724
      40-49   11372
      50-59   10791
      60-69   9583
      70-79   4519
      80-89   4479
      90-99   386

I did find a small number of cases where high-ish (>50%) confidence detection 
was wrong:
http://viniform.typepad.fr/dn/2006/10/mise_jour_du_cl.html
http://www.buscamaniban.com/fr/patrimoine/coeur-armagnac.php
http://www.lafite.com/en/html/Corporate/1.html
http://www.franz-keller.de/8860.html
http://www.vinesnwines.org/?m=200605

In all these cases, ICU4J guessed Latin-1, while the page was (correctly) 
reported(*) or sniffed(*) to be UTF-8. That said, overall ICU4J seems to 
perform quite well. In addition to the overall relevance tests, I used a search 
for the word fragment "teau," which occurs frequently when the word Château is 
parsed with the wrong encoding (making Ch + garbage + teau). Prior to the patch 
I saw 102 occurrences; afterwards I saw 69 occurrences. And many of these 69 
seemed to be on pages where the page had mixed encodings, or had typos, so it 
shows up that way even in the browser. Also, many of the remaining pages were 
text files or RSS feeds (parsed by TextParser, which I haven't yet adapted to 
use the encoding detection; doing that now).

Architecturally I think we should store the detected encoding AND the 
confidence in all cases (even when low), instead of storing it only when the 
confidence meets some threshold. That way the decision of which value to use 
can be made later,  in the parser, which can make a "smart" decision based upon 
all the data that's available (detected, sniffed, reported, plus confidence 
value on detection). Then, for example, if there is no sniffed or reported 
value, we could use the detected value, even if the confidence is low 
(especially useful in the TextParser). We could also make decisions like "the 
confidence is medium, but the same value is both sniffed and reported, so let's 
trust that instead," which might fix some of the detection problem cases.

Hope this all makes sense. I'll keep plugging away at this today and report 
back on what I find. Thanks for all the help and quick responses.

Doug

(*) By "reported," I mean in the HTTP header, and by "sniffed," I mean 
specified in the page metatags (since this is the term used in the code).

> needs 'character encoding' detector
> -----------------------------------
>
>                 Key: NUTCH-25
>                 URL: https://issues.apache.org/jira/browse/NUTCH-25
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Stefan Groschupf
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-25.patch, NUTCH-25_draft.patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to