[Nutch-dev] [jira] Updated: (NUTCH-25) needs 'character encoding' detector

JIRA Mon, 21 May 2007 13:48:54 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Doğacan Güney updated NUTCH-25:
-------------------------------

    Attachment: NUTCH-25_draft.patch

Well, something like this should work...

+ Adds a new configurable parser.charset.autodetect.min.confidence, Nutch will 
set encoding to detected encoding if detection confidence is greater than this 
value. Auto-detection is disabled if value is negative.

+ Adds charset auto-detection logic to Content.java. Uses icu4j(so you need to 
put icu4j's jar under lib to try this).

+ If auto-detection is confident enough, it puts detected encoding to Content's 
Metadata. Plugin parse-html is updated to see this and set encoding accordingly.

+ Uses some code from NUTCH-487 and NUTCH-369 (Thanks, Renaud Richardet and 
Marcin Okraszewski). There is a bug in current parse-html code that if an html 
page specifies an encoding, Neko ignores auto-detected encoding and assumes 
that the encoding specified in page is true. 

I didn't want to do auto-detection in parse-html because other plugins (like 
xml feed parsing plugins) may also need this. Also, IMHO, doing it in 
ParseSegment or ParseUtil wouldn't work, because I may not use those.

> needs 'character encoding' detector
> -----------------------------------
>
>                 Key: NUTCH-25
>                 URL: https://issues.apache.org/jira/browse/NUTCH-25
>             Project: Nutch
>          Issue Type: Wish
>            Reporter: Stefan Groschupf
>            Priority: Trivial
>         Attachments: NUTCH-25_draft.patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Updated: (NUTCH-25) needs 'character encoding' detector

Reply via email to