Jerome,
I have an issue with the language detection plugin, which I'm not sure
how to address. The plugin first tries to extract the language
identifier from meta tags. However, meta tag values people put there are
often completely wrong, or follow obscure pseudo-standards.
Example: there is a bunch of pages, generated by Frontpage, where author
apparently forgot to change the default settings. So, the meta tags say
"en-us", while the real content of the page is in Spanish. The
identify() method shows this clearly.
The final value put in X-meta-lang is "en-us". Now, the question is -
should the plugin override that value with the one from the
auto-detection? This means that it should always run the detection
step... Can we have more confidence in our detection mechanism than in
the author's knowledge? Well, perhaps, if for content longer than xxx
bytes the detection is nearly unambiguous.
Another example: for a bunch of pages in Swedish, I collected the
following values of X-meta-lang:
(SCHEME=ISO.639-1) sv
(SCHEME=ISO639-1) sv
(SCHEME=RFC1766) sv-FI
(SCHEME=Z39.53) SWE
EN_US, SV, EN, EN_UK
English Swedish
English, swedish
English,Swedish
Other (Svenska)
SE
SV
SV charset=iso-8859-1
SV-FI
SV; charset=iso-8859-1
SVE
SW
SWE
SWEDISH
Sv
Sve
Svenska
Swedish
Swedish, svenska
en, sv
se
se, en
se,en,de
se-sv
sv
sv, be, dk, de, fr, no, pt, ch, fi, en
sv, dk, fi, gl, is, fo
sv, dk, no
sv, en
sv, eng
sv, eng, de
sv, fr, eng
sv, nl
sv, no, de
sv, no, en, de, dk, fi
sv,en
sv,en,de,fr
sv,eng
sv,eng,de,fr
sv,no,fi
sv-FI
sv-SE
sv-en
sv-fi
sv-se
sv; Content-Language: sv
sv_SE
sve
svenska
svenska, swedish, engelska, english, norsk, norwegian, polska, polish
sw
swe
swe.SPR.
sweden
swedish
swedish,
text/html; charset=sv-SE
text/html; sv
torp, stuga, uthyres, bed & breakfast
In all cases the value from the detection routine was unambiguous - swedish.
In this light, I propose the following changes:
* modify the identify() method to return a pair of lang code + relative
score (normalized to 0..1)
* in HTMLLanguageParser we should always run
LanguageIdentifier.identify(parse.getText())
* if the meta tag is null, we take the value from identify()
* if the value from identify() is null, we take the meta tag value.
* if the meta tag is not null and the value from identify() is not null:
* if the content is shorter than "lang.analyze.max.length",
we take the meta tag value
* else, if the meta tag and identify values are different:
* if the score from identify() is above "certainty"
threshold (0.8?), we take the value from identify().
* elsee, we take the meta tag value.
Similar changes would be needed in LanguageIndexingFilter.filter(), to
handle text coming from other content types.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers