[Nutch-general] Re: nutch and multilingualism

Ivan Sekulovic Wed, 08 Mar 2006 02:35:06 -0800

Jérôme Charron wrote:

Would it be possible to generate ngram profiles for LanguageIdentifier
plugin from crawled content and not from file? What is my idea? The best
source for content in one language could be wikipedia.org.  We would
just crawl the wikipedia in desired language and then create ngram
profile from it. What are your thoughts about this idea?


I think it could be a good idea.
Wikipedia could be a good source (not sure the best one).
But instead of crawling wikipedia, it would probably be easier to download a
wikipedia dump
(http://download.wikimedia.org/)  and then extracts its textual content to a
file... no?

I agree for wikipedia. But because nutch is content fetching tool itwould be useful to have some sort of tool to use that content forcreating ngram profiles. It seems natural. Maybe it would be possible tocreate some sort of export in plain text of indexed content..


Sekula

[Nutch-general] Re: nutch and multilingualism

Reply via email to