[Nutch-dev] Re: LanguageIdentifier refactoring

Andrzej Bialecki Tue, 05 Jul 2005 06:04:30 -0700

Jerome,

I have an issue with the language detection plugin, which I'm not surehow to address. The plugin first tries to extract the languageidentifier from meta tags. However, meta tag values people put there areoften completely wrong, or follow obscure pseudo-standards.

Example: there is a bunch of pages, generated by Frontpage, where authorapparently forgot to change the default settings. So, the meta tags say"en-us", while the real content of the page is in Spanish. Theidentify() method shows this clearly.

The final value put in X-meta-lang is "en-us". Now, the question is -should the plugin override that value with the one from theauto-detection? This means that it should always run the detectionstep... Can we have more confidence in our detection mechanism than inthe author's knowledge? Well, perhaps, if for content longer than xxxbytes the detection is nearly unambiguous.

Another example: for a bunch of pages in Swedish, I collected thefollowing values of X-meta-lang:


(SCHEME=ISO.639-1) sv
(SCHEME=ISO639-1) sv
(SCHEME=RFC1766) sv-FI
(SCHEME=Z39.53) SWE
EN_US, SV, EN, EN_UK
English Swedish
English, swedish
English,Swedish
Other (Svenska)
SE
SV
SV charset=iso-8859-1
SV-FI
SV; charset=iso-8859-1
SVE
SW
SWE
SWEDISH
Sv
Sve
Svenska
Swedish
Swedish, svenska
en, sv
se
se, en
se,en,de
se-sv
sv
sv, be, dk, de, fr, no, pt, ch, fi, en
sv, dk, fi, gl, is, fo
sv, dk, no
sv, en
sv, eng
sv, eng, de
sv, fr, eng
sv, nl
sv, no, de
sv, no, en, de, dk, fi
sv,en
sv,en,de,fr
sv,eng
sv,eng,de,fr
sv,no,fi
sv-FI
sv-SE
sv-en
sv-fi
sv-se
sv; Content-Language: sv
sv_SE
sve
svenska
svenska, swedish, engelska, english, norsk, norwegian, polska, polish
sw
swe
swe.SPR.
sweden
swedish
swedish,
text/html; charset=sv-SE
text/html; sv
torp, stuga, uthyres, bed & breakfast


In all cases the value from the detection routine was unambiguous - swedish.

In this light, I propose the following changes:

* modify the identify() method to return a pair of lang code + relativescore (normalized to 0..1)

* in HTMLLanguageParser we should always runLanguageIdentifier.identify(parse.getText())


* if the meta tag is null, we take the value from identify()

* if the value from identify() is null, we take the meta tag value.

* if the meta tag is not null and the value from identify() is not null:

        * if the content is shorter than "lang.analyze.max.length",
          we take the meta tag value

        * else, if the meta tag and identify values are different:

                * if the score from identify() is above "certainty"
                  threshold (0.8?), we take the value from identify().

                * elsee, we take the meta tag value.

Similar changes would be needed in LanguageIndexingFilter.filter(), tohandle text coming from other content types.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: LanguageIdentifier refactoring

Reply via email to