[Nutch-dev] Re: LanguageIdentifier refactoring

Andrzej Bialecki Tue, 05 Jul 2005 10:36:07 -0700

Jérôme Charron wrote:

I think, this is an issue for all detection mechanisms...
For the content-type it is the same problem: What is the right value, theone provided by the protocol layer, or the one provided by the extensionmapping, or the one provided by the detection (mime-magic)?
I think, we need to change the actual process, to use auto-detectionmechanisms (this is true at least for code that use the language-identifierand the code that use the mime-type identifier). Instead of doing sometinglike:
1. Get info from protocol
2. If no info from protocol, get info from parsing
3. If no info from parsing, get info from auto-detection

We need to do something like:

1. Get info from protocol
2. Get info from parsing
3. Get degree of confidences from auto-detection, and checks:
3.1 Extracted value from protocol has a high degree of confidence. Take theprotocol value3.2 Extracted value from parsing has a high degree of confidence. Take theparsing value3.3 None has a high degree of confidence, but the auto-detection returnsanother value with a high degree of confidence. Take the auto-detectionvalue.3.4 Take a default value


Yes, I agree.

* modify the identify() method to return a pair of lang code + relative
score (normalized to 0..1)



What do you think about returning a sorted array of lang/score pair?

Yes, that would make sense too. I've been working with a proprietarylanguage detection tool (based on similar principles), and it was alsoreturning a sorted array.

For information, there's some other issues on the language-identifier:
I was focused on performance and precision, and now, that I run it outsideof the "lab", and performs some tests in real life, with real documents, Isaw a very big issue : The LanguageIdentifierPlugin is UTF-8 oriented !!!I discovered this issue and analyze it yesterday: With UTF-8 encoded inputdocuments, you get some very fine identification, but with another encodingit is a disaster.

Mhm. I'm not so sure. The NGramProfile load/save methods are safe, theyboth use UTF-8. LanguageIdentifier.identify() seems to be safe, too -because it only works with Strings, which are not encoded (nativeUnicode). So, the only place where it would be problematic seems to bein the command-line utilities (main methods in both classes), wheresimple change to use InputStreamReader(inputstream, encoding) would fixthe issue...


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: LanguageIdentifier refactoring

Reply via email to