[Nutch-general] Re: nutch and multilingualism

Jérôme Charron Tue, 07 Mar 2006 13:31:01 -0800

> Would it be possible to generate ngram profiles for LanguageIdentifier
> plugin from crawled content and not from file? What is my idea? The best
> source for content in one language could be wikipedia.org.  We would
> just crawl the wikipedia in desired language and then create ngram
> profile from it. What are your thoughts about this idea?


I think it could be a good idea.
Wikipedia could be a good source (not sure the best one).
But instead of crawling wikipedia, it would probably be easier to download a
wikipedia dump
(http://download.wikimedia.org/)  and then extracts its textual content to a
file... no?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

[Nutch-general] Re: nutch and multilingualism

Reply via email to