Jérôme Charron wrote:
I think, this is an issue for all detection mechanisms...
For the content-type it is the same problem: What is the right value, the
one provided by the protocol layer, or the one provided by the extension
mapping, or the one provided by the detection (mime-magic)?
I think, we need to change the actual process, to use auto-detection
mechanisms (this is true at least for code that use the language-identifier
and the code that use the mime-type identifier). Instead of doing someting
like:
1. Get info from protocol
2. If no info from protocol, get info from parsing
3. If no info from parsing, get info from auto-detection
We need to do something like:
1. Get info from protocol
2. Get info from parsing
3. Get degree of confidences from auto-detection, and checks:
3.1 Extracted value from protocol has a high degree of confidence. Take the
protocol value
3.2 Extracted value from parsing has a high degree of confidence. Take the
parsing value
3.3 None has a high degree of confidence, but the auto-detection returns
another value with a high degree of confidence. Take the auto-detection
value.
3.4 Take a default value
Yes, I agree.
* modify the identify() method to return a pair of lang code + relative
score (normalized to 0..1)
What do you think about returning a sorted array of lang/score pair?
Yes, that would make sense too. I've been working with a proprietary
language detection tool (based on similar principles), and it was also
returning a sorted array.
For information, there's some other issues on the language-identifier:
I was focused on performance and precision, and now, that I run it outside
of the "lab", and performs some tests in real life, with real documents, I
saw a very big issue : The LanguageIdentifierPlugin is UTF-8 oriented !!!
I discovered this issue and analyze it yesterday: With UTF-8 encoded input
documents, you get some very fine identification, but with another encoding
it is a disaster.
Mhm. I'm not so sure. The NGramProfile load/save methods are safe, they
both use UTF-8. LanguageIdentifier.identify() seems to be safe, too -
because it only works with Strings, which are not encoded (native
Unicode). So, the only place where it would be problematic seems to be
in the command-line utilities (main methods in both classes), where
simple change to use InputStreamReader(inputstream, encoding) would fix
the issue...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers