Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Olivier Grisel Thu, 13 Jan 2011 09:06:54 -0800

2011/1/13 Jörn Kottmann <[email protected]>:
> On 1/11/11 2:21 PM, Olivier Grisel wrote:
>>
>> 2011/1/4 Olivier Grisel<[email protected]>:
>>>
>>> I plan to give more details in a blog post soon (tm).
>>
>> Here it is:
>>
>> http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html
>>
>> It gives a bit more context and some additional results and clues for
>> improvements and potential new usages.
>>
> Now I read this post too, sounds very interesting.
>
> What is the biggest training file for the name finder you can generate with
> this method?


It depends on the class of the entity you are interested in and the
language of the dump. For instance for the pair (person / French) I
have more than 600k sentences. For English it is gonna be much bigger.
For entity class such as "Drug" or "Protein" this is much lower (I
would say a couple of thousands of sentences).

I trained my French models on my laptop with limited memory (2GB
allocated to the heapspace) hence I stopped at ~100k sentences in the
training file to avoid GC trashing. On Amazon EC2 instances with more
10GB RAM I guess you could train a model on 500k sentences and test it
on the remaining 100k sentences for instance. For such scales average
perceptron learners or SGD-based logistic regression model as
implemented in Apache Mahout would probably be faster to train than
the current MaxEnt impl.

> I think we need MapReduce training support for OpenNLP. Actually that is
> already on my todo list, but currently I am still busy with the Apache 
> migration and the
> next release.

Alright no hurry. Please ping me as soon as you are ready to discuss this.

> Anyway I hope we can get that done at least partially for the name finder
> this year.

Great :)

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Reply via email to