Thanks, I would love to help.  I am just a practitioner though I am not an NLP 
expert.

C
On Apr 27, 2011, at 2:05 PM, Jörn Kottmann wrote:

> On 4/27/11 9:04 PM, Chris Collins wrote:
>> 1) I can understand you cannot distribute the original training set for 
>> english etc because of perhaps distribution rights.  Knowing where or at 
>> least the flavor of where the original corpus came from would be nice.  What 
>> type of people and how many people were used in labeling the data and how 
>> much of it would be useful in determining if we are off.
>> 
> This is actually on my to do list. We need to create a wiki page or so to 
> document the training
> data the english models have been trained on. All the other models are mostly 
> trained on public data.
> 
>> 2) What are the planned models, are there any existing open source projects 
>> that want help on these exercises?
>> 
> There are no plans from my side, if you know of a public corpus, you would 
> like to train OpenNLP on
> we are happy to add native support for it, like we did for a couple of 
> corpora already.
> 
>> 3) I see that with 1.5 there seems to be better support for taking training 
>> sets from other file formats.  What are the motivations?  Is it so that ONLP 
>> can take advantage of existing training sets that will help with 2) or is it 
>> generally to help the community interoperate better?
> 
> Form my side the main motivation was to have data sets people can test 
> OpenNLP on, if someone
> wants to contribute something he can now at least test the modification.
> Another motivation is that the more languages and corpora we support the more 
> people are interested
> in working on and with OpenNLP.
> 
> BTW, we had a discussion here to start a wikinews (and also wikipedia) 
> content based corpus project,
> maybe you would be interested in helping with that.
> 
> Jörn
> 
> 

Reply via email to