The Gutenberg project is a nice page to find sources that can be used to generate ngrams for a set of languages.Sami, do you uses the whole set available at http://people.csail.mit.edu/people/koehn/publications/europarl/ , or just some parts of text to build the profiles? (If I correctly remember my previous works on ngrams, just a few Mo are necessary to have a representative set of 3-grams).
I used a relative small subset - just a few MB to build the profiles.
http://www.gutenberg.org/
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
