One thing that can be done is to move the n-gram language detection calls to the HTMLLangaugeParser (a HtmlParseFilter plugin). After
This is not practical, because content comes from other parser plugins as well, so this code fragment would have to be added to all plugins...
getting the results of the language detection, set a parse metadata field. Modify LanguageIdentifier IndexFilter plugin to look for this metadata field instead of running n-gram language detection.
In my experience the meta tags in HTML files are not to be trusted. I would even go as far as to say that it is actually better when they are missing. In many cases the language tag is either wrong (it says "en", because the page was prepared by an English version of the application - and the page content is non-english), or so non-standard as to be useless. I base my observations on a corpus of ~20mln web pages, so I think I'm not that far from the truth... here's an excerpt from the "lang" metadata field in that collection - all pages are actually written in Swedish:
(SCHEME=ISO.639-1) sv (SCHEME=ISO639-1) sv (SCHEME=RFC1766) sv-FI (SCHEME=Z39.53) SWE EN_US, SV, EN, EN_UK English Swedish English, swedish English,Swedish Other (Svenska) SE SV SV charset=iso-8859-1 SV-FI SV; charset=iso-8859-1 SVE SW SWE SWEDISH Sv Sve Svenska Swedish Swedish, svenska en, sv se se, en se,en,de se-sv sv sv, be, dk, de, fr, no, pt, ch, fi, en sv, dk, fi, gl, is, fo sv, dk, no sv, en sv, eng sv, eng, de sv, fr, eng sv, nl sv, no, de sv, no, en, de, dk, fi sv,en sv,en,de,fr sv,eng sv,eng,de,fr sv,no,fi sv-FI sv-SE sv-en sv-fi sv-se sv; Content-Language: sv sv_SE sve svenska svenska, swedish, engelska, english, norsk, norwegian, polska, polish sw swe swe.SPR. sweden swedish swedish, text/html; charset=sv-SE text/html; sv torp, stuga, uthyres, bed & breakfast ...
So, I think that it is much better to run a language detection plugin that to trust the meta tags. The same principle applies to other meta tags, which are nowadays used almost exclusively for manipulating the search engines results.
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
