> Would it be possible to generate ngram profiles for LanguageIdentifier > plugin from crawled content and not from file? What is my idea? The best > source for content in one language could be wikipedia.org. We would > just crawl the wikipedia in desired language and then create ngram > profile from it. What are your thoughts about this idea?
I think it could be a good idea. Wikipedia could be a good source (not sure the best one). But instead of crawling wikipedia, it would probably be easier to download a wikipedia dump (http://download.wikimedia.org/) and then extracts its textual content to a file... no? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
