Hi Jerome!
Would it be possible to generate ngram profiles for LanguageIdentifier
plugin from crawled content and not from file? What is my idea? The best
source for content in one language could be wikipedia.org. We would
just crawl the wikipedia in desired language and then create ngram
profile from it. What are your thoughts about this idea?
Best Regards,
Ivan
Jérôme Charron wrote:
What is the good strategy to adopt for multilingualism sites ?
I want nutch to index a site in the different languages and
then, the search only prints results that are in the user language.
Hi Laurent,
What I can suggest is to :
1. use the languageidentifier plugin while crawling in order to guess the
language of the content
2. automatically filters the results by adding the lang:<user_agent_lang>
clause to the query (could be done in the jsp).
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.375 / Virus Database: 268.1.1/273 - Release Date: 2.3.2006