Hi Jerome!

Would it be possible to generate ngram profiles for LanguageIdentifier plugin from crawled content and not from file? What is my idea? The best source for content in one language could be wikipedia.org. We would just crawl the wikipedia in desired language and then create ngram profile from it. What are your thoughts about this idea?

Best Regards,
Ivan



Jérôme Charron wrote:

What is the good strategy to adopt for multilingualism sites ?

I want nutch to index a site in the different languages and
then, the search only prints results that are in the user language.

Hi Laurent,

What I can suggest is to :
1. use the languageidentifier plugin while crawling in order to guess the
language of the content
2. automatically filters the results by adding the lang:<user_agent_lang>
clause to the query (could be done in the jsp).

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

------------------------------------------------------------------------

No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.375 / Virus Database: 268.1.1/273 - Release Date: 2.3.2006

Reply via email to