[Nutch-general] Re: nutch and multilingualism

Ivan Sekulovic Mon, 06 Mar 2006 00:27:01 -0800

Hi Jerome!

Would it be possible to generate ngram profiles for LanguageIdentifierplugin from crawled content and not from file? What is my idea? The bestsource for content in one language could be wikipedia.org. We wouldjust crawl the wikipedia in desired language and then create ngramprofile from it. What are your thoughts about this idea?


Best Regards,
Ivan



Jérôme Charron wrote:

What is the good strategy to adopt for multilingualism sites ?


I want nutch to index a site in the different languages and

then, the search only prints results that are in the user language.


Hi Laurent,

What I can suggest is to :
1. use the languageidentifier plugin while crawling in order to guess the
language of the content
2. automatically filters the results by adding the lang:<user_agent_lang>
clause to the query (could be done in the jsp).

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

------------------------------------------------------------------------

No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.375 / Virus Database: 268.1.1/273 - Release Date: 2.3.2006

[Nutch-general] Re: nutch and multilingualism

Reply via email to