Thanks, I would love to help. I am just a practitioner though I am not an NLP expert.
C On Apr 27, 2011, at 2:05 PM, Jörn Kottmann wrote: > On 4/27/11 9:04 PM, Chris Collins wrote: >> 1) I can understand you cannot distribute the original training set for >> english etc because of perhaps distribution rights. Knowing where or at >> least the flavor of where the original corpus came from would be nice. What >> type of people and how many people were used in labeling the data and how >> much of it would be useful in determining if we are off. >> > This is actually on my to do list. We need to create a wiki page or so to > document the training > data the english models have been trained on. All the other models are mostly > trained on public data. > >> 2) What are the planned models, are there any existing open source projects >> that want help on these exercises? >> > There are no plans from my side, if you know of a public corpus, you would > like to train OpenNLP on > we are happy to add native support for it, like we did for a couple of > corpora already. > >> 3) I see that with 1.5 there seems to be better support for taking training >> sets from other file formats. What are the motivations? Is it so that ONLP >> can take advantage of existing training sets that will help with 2) or is it >> generally to help the community interoperate better? > > Form my side the main motivation was to have data sets people can test > OpenNLP on, if someone > wants to contribute something he can now at least test the modification. > Another motivation is that the more languages and corpora we support the more > people are interested > in working on and with OpenNLP. > > BTW, we had a discussion here to start a wikinews (and also wikipedia) > content based corpus project, > maybe you would be interested in helping with that. > > Jörn > >
