2011/1/13 Jörn Kottmann <[email protected]>: > On 1/11/11 2:21 PM, Olivier Grisel wrote: >> >> 2011/1/4 Olivier Grisel<[email protected]>: >>> >>> I plan to give more details in a blog post soon (tm). >> >> Here it is: >> >> http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html >> >> It gives a bit more context and some additional results and clues for >> improvements and potential new usages. >> > Now I read this post too, sounds very interesting. > > What is the biggest training file for the name finder you can generate with > this method?
It depends on the class of the entity you are interested in and the language of the dump. For instance for the pair (person / French) I have more than 600k sentences. For English it is gonna be much bigger. For entity class such as "Drug" or "Protein" this is much lower (I would say a couple of thousands of sentences). I trained my French models on my laptop with limited memory (2GB allocated to the heapspace) hence I stopped at ~100k sentences in the training file to avoid GC trashing. On Amazon EC2 instances with more 10GB RAM I guess you could train a model on 500k sentences and test it on the remaining 100k sentences for instance. For such scales average perceptron learners or SGD-based logistic regression model as implemented in Apache Mahout would probably be faster to train than the current MaxEnt impl. > I think we need MapReduce training support for OpenNLP. Actually that is > already on my todo list, but currently I am still busy with the Apache > migration and the > next release. Alright no hurry. Please ping me as soon as you are ready to discuss this. > Anyway I hope we can get that done at least partially for the name finder > this year. Great :) -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
