Andy Liu wrote:
One thing that can be done is to move the n-gram language detection
calls to the HTMLLangaugeParser (a HtmlParseFilter plugin).  After

This is not practical, because content comes from other parser plugins as well, so this code fragment would have to be added to all plugins...


getting the results of the language detection, set a parse metadata
field.  Modify LanguageIdentifier IndexFilter plugin to look for this
metadata field instead of running n-gram language detection.

In my experience the meta tags in HTML files are not to be trusted. I would even go as far as to say that it is actually better when they are missing. In many cases the language tag is either wrong (it says "en", because the page was prepared by an English version of the application - and the page content is non-english), or so non-standard as to be useless. I base my observations on a corpus of ~20mln web pages, so I think I'm not that far from the truth... here's an excerpt from the "lang" metadata field in that collection - all pages are actually written in Swedish:


(SCHEME=ISO.639-1) sv
(SCHEME=ISO639-1) sv
(SCHEME=RFC1766) sv-FI
(SCHEME=Z39.53) SWE
EN_US, SV, EN, EN_UK
English Swedish
English, swedish
English,Swedish
Other (Svenska)
SE
SV
SV charset=iso-8859-1
SV-FI
SV; charset=iso-8859-1
SVE
SW
SWE
SWEDISH
Sv
Sve
Svenska
Swedish
Swedish, svenska
en, sv
se
se, en
se,en,de
se-sv
sv
sv, be, dk, de, fr, no, pt, ch, fi, en
sv, dk, fi, gl, is, fo
sv, dk, no
sv, en
sv, eng
sv, eng, de
sv, fr, eng
sv, nl
sv, no, de
sv, no, en, de, dk, fi
sv,en
sv,en,de,fr
sv,eng
sv,eng,de,fr
sv,no,fi
sv-FI
sv-SE
sv-en
sv-fi
sv-se
sv; Content-Language: sv
sv_SE
sve
svenska
svenska, swedish, engelska, english, norsk, norwegian, polska, polish
sw
swe
swe.SPR.
sweden
swedish
swedish,
text/html; charset=sv-SE
text/html; sv
torp, stuga, uthyres, bed & breakfast
...


So, I think that it is much better to run a language detection plugin that to trust the meta tags. The same principle applies to other meta tags, which are nowadays used almost exclusively for manipulating the search engines results.


--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Reply via email to