[Nutch-dev] Re: LanguageIdentifier refactoring

Jérôme Charron Tue, 05 Jul 2005 06:54:57 -0700

> I have an issue with the language detection plugin, which I'm not sure
> how to address. The plugin first tries to extract the language
> identifier from meta tags. However, meta tag values people put there are
> often completely wrong, or follow obscure pseudo-standards.
> 
> Example: there is a bunch of pages, generated by Frontpage, where author
> apparently forgot to change the default settings. So, the meta tags say
> "en-us", while the real content of the page is in Spanish. The
> identify() method shows this clearly.



The final value put in X-meta-lang is "en-us". Now, the question is -
> should the plugin override that value with the one from the
> auto-detection? This means that it should always run the detection
> step... Can we have more confidence in our detection mechanism than in
> the author's knowledge? Well, perhaps, if for content longer than xxx
> bytes the detection is nearly unambiguous.

I think, this is an issue for all detection mechanisms...
For the content-type it is the same problem: What is the right value, the 
one provided by the protocol layer, or the one provided by the extension 
mapping, or the one provided by the detection (mime-magic)?

I think, we need to change the actual process, to use auto-detection 
mechanisms (this is true at least for code that use the language-identifier 
and the code that use the mime-type identifier). Instead of doing someting 
like:

1. Get info from protocol
2. If no info from protocol, get info from parsing
3. If no info from parsing, get info from auto-detection

We need to do something like:

1. Get info from protocol
2. Get info from parsing
3. Get degree of confidences from auto-detection, and checks:
3.1 Extracted value from protocol has a high degree of confidence. Take the 
protocol value
3.2 Extracted value from parsing has a high degree of confidence. Take the 
parsing value
3.3 None has a high degree of confidence, but the auto-detection returns 
another value with a high degree of confidence. Take the auto-detection 
value.
3.4 Take a default value 

Another example: for a bunch of pages in Swedish, I collected the
> following values of X-meta-lang:
> 
> (SCHEME=ISO.639-1) sv
> (SCHEME=ISO639-1) sv
> (SCHEME=RFC1766) sv-FI
> (SCHEME=Z39.53) SWE
> EN_US, SV, EN, EN_UK
> English Swedish
> English, swedish
> English,Swedish
> Other (Svenska)
> SE
> SV
> SV charset=iso-8859-1
> SV-FI
> SV; charset=iso-8859-1
> SVE
> SW
> SWE
> SWEDISH
> Sv
> Sve
> Svenska
> Swedish
> Swedish, svenska
> en, sv
> se
> se, en
> se,en,de
> se-sv
> sv
> sv, be, dk, de, fr, no, pt, ch, fi, en
> sv, dk, fi, gl, is, fo
> sv, dk, no
> sv, en
> sv, eng
> sv, eng, de
> sv, fr, eng
> sv, nl
> sv, no, de
> sv, no, en, de, dk, fi
> sv,en
> sv,en,de,fr
> sv,eng
> sv,eng,de,fr
> sv,no,fi
> sv-FI
> sv-SE
> sv-en
> sv-fi
> sv-se
> sv; Content-Language: sv
> sv_SE
> sve
> svenska
> svenska, swedish, engelska, english, norsk, norwegian, polska, polish
> sw
> swe
> swe.SPR.
> sweden
> swedish
> swedish,
> text/html; charset=sv-SE
> text/html; sv
> torp, stuga, uthyres, bed & breakfast
> In all cases the value from the detection routine was unambiguous - 
> swedish.

Yes, I recently saw this problem while analyzing my indexes... 
A first step, could be to improve the Content-language / dc.language / html 
lang parsers.
(It could be done in the HTMLLanguageParser)

In this light, I propose the following changes:
> 
> * modify the identify() method to return a pair of lang code + relative
> score (normalized to 0..1)

What do you think about returning a sorted array of lang/score pair?

> * in HTMLLanguageParser we should always run
> LanguageIdentifier.identify(parse.getText())

Yes! 

For information, there's some other issues on the language-identifier:
I was focused on performance and precision, and now, that I run it outside 
of the "lab", and performs some tests in real life, with real documents, I 
saw a very big issue : The LanguageIdentifierPlugin is UTF-8 oriented !!!
I discovered this issue and analyze it yesterday: With UTF-8 encoded input 
documents, you get some very fine identification, but with another encoding 
it is a disaster.
Sami (I think you were the original and first coder of the 
LanguageIdentifierPlugin), do you already know this problem? Do you have 
some ideas about solving it?
Actually, it is a very big issue, and the language-identifier can not be 
used on a real crawl.

Thanks Andrzej for your feed-back and ideas.
(I will continue to focus my work on the encoding problem, but once I can 
commit, I will implement the changes you suggest in this mail)

In fact, there's still a lot of TODOs in the languageidentifier => the most 
I work on it, the most I see some issues to fix, but it is a very important 
module if we want to add Multi-lingual support in Nutch.
So, I will update Wiki pages about language identifier in order to keep 
trace of all these fixes/ideas/issues....

Best Regards

Jerome

-- 
http://motrech.free.fr/
http://www.frutch.org/

[Nutch-dev] Re: LanguageIdentifier refactoring

Reply via email to