> I have an issue with the language detection plugin, which I'm not sure > how to address. The plugin first tries to extract the language > identifier from meta tags. However, meta tag values people put there are > often completely wrong, or follow obscure pseudo-standards. > > Example: there is a bunch of pages, generated by Frontpage, where author > apparently forgot to change the default settings. So, the meta tags say > "en-us", while the real content of the page is in Spanish. The > identify() method shows this clearly.
The final value put in X-meta-lang is "en-us". Now, the question is - > should the plugin override that value with the one from the > auto-detection? This means that it should always run the detection > step... Can we have more confidence in our detection mechanism than in > the author's knowledge? Well, perhaps, if for content longer than xxx > bytes the detection is nearly unambiguous. I think, this is an issue for all detection mechanisms... For the content-type it is the same problem: What is the right value, the one provided by the protocol layer, or the one provided by the extension mapping, or the one provided by the detection (mime-magic)? I think, we need to change the actual process, to use auto-detection mechanisms (this is true at least for code that use the language-identifier and the code that use the mime-type identifier). Instead of doing someting like: 1. Get info from protocol 2. If no info from protocol, get info from parsing 3. If no info from parsing, get info from auto-detection We need to do something like: 1. Get info from protocol 2. Get info from parsing 3. Get degree of confidences from auto-detection, and checks: 3.1 Extracted value from protocol has a high degree of confidence. Take the protocol value 3.2 Extracted value from parsing has a high degree of confidence. Take the parsing value 3.3 None has a high degree of confidence, but the auto-detection returns another value with a high degree of confidence. Take the auto-detection value. 3.4 Take a default value Another example: for a bunch of pages in Swedish, I collected the > following values of X-meta-lang: > > (SCHEME=ISO.639-1) sv > (SCHEME=ISO639-1) sv > (SCHEME=RFC1766) sv-FI > (SCHEME=Z39.53) SWE > EN_US, SV, EN, EN_UK > English Swedish > English, swedish > English,Swedish > Other (Svenska) > SE > SV > SV charset=iso-8859-1 > SV-FI > SV; charset=iso-8859-1 > SVE > SW > SWE > SWEDISH > Sv > Sve > Svenska > Swedish > Swedish, svenska > en, sv > se > se, en > se,en,de > se-sv > sv > sv, be, dk, de, fr, no, pt, ch, fi, en > sv, dk, fi, gl, is, fo > sv, dk, no > sv, en > sv, eng > sv, eng, de > sv, fr, eng > sv, nl > sv, no, de > sv, no, en, de, dk, fi > sv,en > sv,en,de,fr > sv,eng > sv,eng,de,fr > sv,no,fi > sv-FI > sv-SE > sv-en > sv-fi > sv-se > sv; Content-Language: sv > sv_SE > sve > svenska > svenska, swedish, engelska, english, norsk, norwegian, polska, polish > sw > swe > swe.SPR. > sweden > swedish > swedish, > text/html; charset=sv-SE > text/html; sv > torp, stuga, uthyres, bed & breakfast > In all cases the value from the detection routine was unambiguous - > swedish. Yes, I recently saw this problem while analyzing my indexes... A first step, could be to improve the Content-language / dc.language / html lang parsers. (It could be done in the HTMLLanguageParser) In this light, I propose the following changes: > > * modify the identify() method to return a pair of lang code + relative > score (normalized to 0..1) What do you think about returning a sorted array of lang/score pair? > * in HTMLLanguageParser we should always run > LanguageIdentifier.identify(parse.getText()) Yes! For information, there's some other issues on the language-identifier: I was focused on performance and precision, and now, that I run it outside of the "lab", and performs some tests in real life, with real documents, I saw a very big issue : The LanguageIdentifierPlugin is UTF-8 oriented !!! I discovered this issue and analyze it yesterday: With UTF-8 encoded input documents, you get some very fine identification, but with another encoding it is a disaster. Sami (I think you were the original and first coder of the LanguageIdentifierPlugin), do you already know this problem? Do you have some ideas about solving it? Actually, it is a very big issue, and the language-identifier can not be used on a real crawl. Thanks Andrzej for your feed-back and ideas. (I will continue to focus my work on the encoding problem, but once I can commit, I will implement the changes you suggest in this mail) In fact, there's still a lot of TODOs in the languageidentifier => the most I work on it, the most I see some issues to fix, but it is a very important module if we want to add Multi-lingual support in Nutch. So, I will update Wiki pages about language identifier in order to keep trace of all these fixes/ideas/issues.... Best Regards Jerome -- http://motrech.free.fr/ http://www.frutch.org/
