[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

JIRA Sat, 21 Jul 2007 12:05:34 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514433
 ]


Doğacan Güney commented on NUTCH-25:
------------------------------------

Doug, thanks for the (very) detailed feedback! This is incredibly helpful.

> I did find a small number of cases where high-ish (>50%) confidence detection 
> was wrong:
> http://viniform.typepad.fr/dn/2006/10/mise_jour_du_cl.html
> http://www.buscamaniban.com/fr/patrimoine/coeur-armagnac.php
> http://www.lafite.com/en/html/Corporate/1.html
> http://www.franz-keller.de/8860.html
> http://www.vinesnwines.org/?m=200605 

Unfortunately, it seems there is not much we can do about these. I tried adding 
a detector.setDeclaredEncoding("UTF-8") before detection and it didn't help 
(UTF-8 confidence is surprisingly low, around 25). I also tried jchardet ( 
http://jchardet.sourceforge.net/ ) with these pages and it doesn't detect them 
as UTF-8 either, which is strange considering that mozilla does detect them 
correctly.

> Architecturally I think we should store the detected encoding AND the 
> confidence in all cases (even when low), instead of storing it only when the 
> > confidence meets some threshold. That way the decision of which value to 
> use can be made later, in the parser, which can make a "smart" 
> decision based upon all the data that's available (detected, sniffed, 
> reported, plus confidence value on detection). Then, for example, if there is 
> no sniffed or reported value, we could use the detected value, even if the 
> confidence is low (especially useful in the TextParser). We could also 
> make decisions like "the confidence is medium, but the same value is both 
> sniffed and reported, so let's trust that instead," which might fix some > of 
> the detection problem cases. 

Good idea but implementation-wise I would suggest that we rip out the detection 
code from Content.java and move it into parse-html (and whatever else wants to 
detect encoding). There will be some code duplication but this way parse-html 
can get all the possible matches (via detector.detectAll) and then use sniffed 
and reported to make a decision. For example, auto-detect may return 
(ISO-8859-1, 60), (UTF-8, 25), sniffed and reported may both be UTF-8. Based on 
this data, parse-html can then choose UTF-8 (an idea would be to make sniffed 
and reported to add a delta, say 20, confidence). What do you think?


> needs 'character encoding' detector
> -----------------------------------
>
>                 Key: NUTCH-25
>                 URL: https://issues.apache.org/jira/browse/NUTCH-25
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Stefan Groschupf
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-25.patch, NUTCH-25_draft.patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

Reply via email to