Jérôme Charron wrote:

I think, this is an issue for all detection mechanisms...
For the content-type it is the same problem: What is the right value, the one provided by the protocol layer, or the one provided by the extension mapping, or the one provided by the detection (mime-magic)?

I think, we need to change the actual process, to use auto-detection mechanisms (this is true at least for code that use the language-identifier and the code that use the mime-type identifier). Instead of doing someting like:

1. Get info from protocol
2. If no info from protocol, get info from parsing
3. If no info from parsing, get info from auto-detection

We need to do something like:

1. Get info from protocol
2. Get info from parsing
3. Get degree of confidences from auto-detection, and checks:
3.1 Extracted value from protocol has a high degree of confidence. Take the protocol value 3.2 Extracted value from parsing has a high degree of confidence. Take the parsing value 3.3 None has a high degree of confidence, but the auto-detection returns another value with a high degree of confidence. Take the auto-detection value. 3.4 Take a default value

Yes, I agree.

* modify the identify() method to return a pair of lang code + relative
score (normalized to 0..1)


What do you think about returning a sorted array of lang/score pair?

Yes, that would make sense too. I've been working with a proprietary language detection tool (based on similar principles), and it was also returning a sorted array.

For information, there's some other issues on the language-identifier:
I was focused on performance and precision, and now, that I run it outside of the "lab", and performs some tests in real life, with real documents, I saw a very big issue : The LanguageIdentifierPlugin is UTF-8 oriented !!! I discovered this issue and analyze it yesterday: With UTF-8 encoded input documents, you get some very fine identification, but with another encoding it is a disaster.

Mhm. I'm not so sure. The NGramProfile load/save methods are safe, they both use UTF-8. LanguageIdentifier.identify() seems to be safe, too - because it only works with Strings, which are not encoded (native Unicode). So, the only place where it would be problematic seems to be in the command-line utilities (main methods in both classes), where simple change to use InputStreamReader(inputstream, encoding) would fix the issue...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to