Re: lang identifier and nutch analyzer in trunk

Andrzej Bialecki Tue, 24 Jan 2006 03:11:39 -0800

Jérôme Charron wrote:

Is it reasonable to guess language info. from target servers geographical
info.?


Yes, it could be another clue to guess language.
But the problem is then to find how to use all these indices.

For instance, the actual solution is the easiest one, but certainly not the
more efficient one:
For HTML documents, the HTMLLanguageParser scans HTML documents looking at
possible indications of content language:
1. html lang attribute
2. meta dc.language
3. meta http-equiv
The first one found is assumed to be the document's language.
Then if no language is found, the statistical language identifier is
used....

We're going back to the old discussion - most web pages out there eitherdon't have these tags at all, or even if they have it it contains wrongvalues ... so, I think this policy is not going to give the best results.

IMHO we should always try to guess the language if we have enough text,unless we can be sure that we deal with properly marked documents (notsuch uncommon case in Intranets).


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: lang identifier and nutch analyzer in trunk

Reply via email to