Jérôme Charron wrote:
Is it reasonable to guess language info. from target servers geographical
info.?
Yes, it could be another clue to guess language.
But the problem is then to find how to use all these indices.
For instance, the actual solution is the easiest one, but certainly not the
more efficient one:
For HTML documents, the HTMLLanguageParser scans HTML documents looking at
possible indications of content language:
1. html lang attribute
2. meta dc.language
3. meta http-equiv
The first one found is assumed to be the document's language.
Then if no language is found, the statistical language identifier is
used....
We're going back to the old discussion - most web pages out there either
don't have these tags at all, or even if they have it it contains wrong
values ... so, I think this policy is not going to give the best results.
IMHO we should always try to guess the language if we have enough text,
unless we can be sure that we deal with properly marked documents (not
such uncommon case in Intranets).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com