[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

Doug Cook (JIRA) Wed, 25 Jul 2007 09:40:03 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515342
 ]


Doug Cook commented on NUTCH-25:
--------------------------------

Doğacan,

Thanks for the quick feedback.

> * EncodingDetector api is way too open. IMO, EncodingClue should be a private 
> static 
> class (users can pass a clue like detector.addClue(value, source, 
> confidence)), EncodingDetector 
> should not expose clues ever (for example, autoDetectClues should return void 
> [or perhaps a 
> boolean indicating the success of autodetect]) and store clues internally. 

Good point. I had thought that callers might want to manipulate the list, but 
this is probably unlikely, and my current approach certainly allows more for 
caller screwup through playing with the passed List. It's an easy fix to make, 
and it cleans up the calling code a little bit, too. I'll fix that.

If in the future, the callers need to manipulate the list, we can just add an 
interface for that.

> * code: 
> 
>     public boolean meetsThreshold() { 
>       Integer mt = (Integer) thresholds.get(value); 
>       int myThreshold = (mt != null) ? mt.intValue() : minConfidence; // use 
> global value if no encoding-specific value found 
>        
>       return (confidence < 0 || (minConfidence >= 0 && 
> confidence>=myThreshold)); 
>     } 
> 
> Why does meetsTreshold return true if confidence < 0? 

Negative confidence values have special semantics. It means "use me if you get 
to me in the list, and ignore the threshold." These semantics are necessary to 
emulate the prior behavior (where, for example, header values would always be 
used, if present, in preference to 'sniffed' meta-tags). Not that the prior 
behavior was perfect, but I think it's a useful construct: a value which, if 
present, should  be used regardless of confidence thresholds.

> * If users specify an encoding clue with no confidence then we should give it 
> a default 
> positive confidence instead of -1. Of course, confidence value needs to be 
> very very small, maybe just +1. 

Hopefully this design choice makes more sense in light of the previous comment. 
The -1 has special semantics, meaning "I don't have a threshold."

> * It would be nice to "stack" clues. Assume that autodetection returned 2 
> possible encodings: 
> ISO-8859-1 with 50 confidence and UTF-8 with 45 confidence. If I add a new 
> clue (say, coming from 
> http header) for UTF-8 with +6 confidence, overall confidence for UTF-8 
> should now be 51. 

I'm not sure if you mean actually combine the clues in the list, or just add 
the values in guessEncoding.

Architecturally I think it's better to keep all the clues intact until the 
final "guess" is made. I've tried to make guessEncoding the place where all the 
policy decisions are made, the method to be overridden if someone has a better 
guessing algorithm.  Internal to guessEncoding, we could certainly add the clue 
values if it turns out that helps us make a better guess. 

Combining clues prior to guessEncoding is throwing away information -- clues 
might be additive, but they might not (two highly correlated pieces of data 
won't be additive, and inversely correlated features will even be 
"subtractive"). Ideally someone could make a large-ish test set, judge the 
"real" encoding for all the examples, do the statistics, and find out how all 
the (detected encoding, header value, metatags) interact. A guessEncoding based 
on statistical modeling would be pretty sweet. When I was working for a certain 
large search company, this is how we would typically tackle a problem like 
this. I'm certain that's how CharsetDetector was created in the first place.

In the mean time, the simple algorithm provided seems to do reasonably well (it 
does very nearly what your version did, which seems like a fine place to start).

It's worth adding that CharsetDetector also detects languages, and a few 
examples I looked at seemed pretty good. It seems a shame to throw away that 
information, especially when I know Nutch's built-in language detection makes a 
fair number of mistakes (though in part because it trusts the page metatags, 
which are often wrong). Another bit of food for thought.

> * This is mostly my personal nit, but Java 5 style generics would be nice. 

Ah, you caught me. I'm still working in a 1.4-ish environment. 

> About contributing stuff back: [...]

Many thanks. This is pretty much what I'd assumed; unfortunately it will be a 
while before I have time and can afford the risk of bringing 0.9 changes into 
my local installation. But of course, the longer I wait, the more difficult the 
merge will be ;-) Oh well, I'll get there! There are a couple of important 
bugfixes for which I'll try to make the time to port earlier.

D

> needs 'character encoding' detector
> -----------------------------------
>
>                 Key: NUTCH-25
>                 URL: https://issues.apache.org/jira/browse/NUTCH-25
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Stefan Groschupf
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: EncodingDetector.java, NUTCH-25.patch, 
> NUTCH-25_draft.patch, patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

Reply via email to