Jörn is absolutely right about that. Another good source of training data is MASC. I've got some instructions for training models with MASC here:
https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial Chalk (now defunct) provided a Scala wrapper around OpenNLP functionality, so the instructions there should make it fairly straightforward to adapt MASC data to OpenNLP. -Jason On Tue, 21 Jun 2016 at 10:46 Joern Kottmann <[email protected]> wrote: > There are some research papers which study and compare the performance of > NLP toolkits, but be careful often they don't train the NLP tools on the > same data and the training data makes a big difference on the performance. > > Jörn > > On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann <[email protected]> > wrote: > > > Just don't use the very old existing models, to get good results you have > > to train on your own data, especially if the domain of the data used for > > training and the data which should be processed doesn't match. The old > > models are trained on 90s news, those don't work well on todays news and > > probably much worse on tweets. > > > > OntoNots is a good place to start if the goal is to process news. OpenNLP > > comes with build-in support to train models from OntoNotes. > > > > Jörn > > > > On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) < > > [email protected]> wrote: > > > >> This sounds like a fantastic idea. > >> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Chris Mattmann, Ph.D. > >> Chief Architect > >> Instrument Software and Science Data Systems Section (398) > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >> Office: 168-519, Mailstop: 168-527 > >> Email: [email protected] > >> WWW: http://sunset.usc.edu/~mattmann/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Director, Information Retrieval and Data Science Group (IRDS) > >> Adjunct Associate Professor, Computer Science Department > >> University of Southern California, Los Angeles, CA 90089 USA > >> WWW: http://irds.usc.edu/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> On 6/21/16, 12:13 AM, "Anthony Beylerian" <[email protected] > > > >> wrote: > >> > >> >+1 > >> > > >> >Maybe we could put the results of the evaluator tests for each > component > >> somewhere on a webpage and on every release update them. > >> >This is of course provided there are reasonable data sets for testing > >> each component. > >> >What do you think? > >> > > >> >Anthony > >> > > >> >> From: [email protected] > >> >> Date: Tue, 21 Jun 2016 15:59:47 +0900 > >> >> Subject: Re: Performances of OpenNLP tools > >> >> To: [email protected] > >> >> > >> >> Hi, > >> >> > >> >> Thank you for your replies. > >> >> > >> >> Please Jeffrey accept once more my apologies for receiving the email > >> twice. > >> >> > >> >> I also think it would be great to have such studies on the > >> performances of > >> >> OpenNLP. > >> >> > >> >> I have been looking for this information and checked in many places, > >> >> including obviously google scholar, and I haven't found any serious > >> studies > >> >> or reliable results. Most of the existing ones report the > performances > >> of > >> >> outdated releases of OpenNLP, and focus more on the execution time or > >> >> CPU/RAM consumption, etc. > >> >> > >> >> I think such a comparison will help not only evaluate the overall > >> accuracy, > >> >> but also highlight the issues with the existing models (as a matter > of > >> >> fact, the existing models fail to recognize many of the hashtags in > >> tweets: > >> >> the tokenizer splits them into the "#" symbol and a word that the PoS > >> >> tagger also fails to recognize). > >> >> > >> >> Therefore, building Twitter-based models would also be useful, since > >> many > >> >> of the works in academia / industry are focusing on Twitter data. > >> >> > >> >> Best regards, > >> >> > >> >> Mondher > >> >> > >> >> > >> >> > >> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge < > >> [email protected]> > >> >> wrote: > >> >> > >> >> > It would be fantastic to have these numbers. This is an example of > >> >> > something that would be a great contribution by someone trying to > >> >> > contribute to open source and who is maybe just getting into > machine > >> >> > learning and natural language processing. > >> >> > > >> >> > For Twitter-ish text, it'd be great to look at models trained and > >> evaluated > >> >> > on the Tweet NLP resources: > >> >> > > >> >> > http://www.cs.cmu.edu/~ark/TweetNLP/ > >> >> > > >> >> > And comparing to how their models performed, etc. Also, it's worth > >> looking > >> >> > at spaCy (Python NLP modules) for further comparisons. > >> >> > > >> >> > https://spacy.io/ > >> >> > > >> >> > -Jason > >> >> > > >> >> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick < > [email protected]> > >> >> > wrote: > >> >> > > >> >> > > I saw the same question on the users list on June 17. At least I > >> thought > >> >> > it > >> >> > > was the same question -- sorry if it wasn't. > >> >> > > > >> >> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) < > >> >> > > [email protected]> wrote: > >> >> > > > >> >> > > > Well, hold on. He sent that mail (as of the time of this mail) > 4 > >> >> > > > mins previously. Maybe some folks need some time to reply ^_^ > >> >> > > > > >> >> > > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> >> > > > Chris Mattmann, Ph.D. > >> >> > > > Chief Architect > >> >> > > > Instrument Software and Science Data Systems Section (398) > >> >> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >> >> > > > Office: 168-519, Mailstop: 168-527 > >> >> > > > Email: [email protected] > >> >> > > > WWW: http://sunset.usc.edu/~mattmann/ > >> >> > > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> >> > > > Director, Information Retrieval and Data Science Group (IRDS) > >> >> > > > Adjunct Associate Professor, Computer Science Department > >> >> > > > University of Southern California, Los Angeles, CA 90089 USA > >> >> > > > WWW: http://irds.usc.edu/ > >> >> > > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <[email protected]> > >> wrote: > >> >> > > > > >> >> > > > >Hi Mondher, > >> >> > > > > > >> >> > > > >Since you didn't get any replies I'm guessing no one is aware > >> of any > >> >> > > > >resources related to what you need. Google Scholar is a good > >> place to > >> >> > > look > >> >> > > > >for papers referencing OpenNLP and its methods (in case you > >> haven't > >> >> > > > >searched it already). > >> >> > > > > > >> >> > > > >Jeff > >> >> > > > > > >> >> > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi < > >> >> > > > >[email protected]> wrote: > >> >> > > > > > >> >> > > > >> Hi, > >> >> > > > >> > >> >> > > > >> Apologies if you received multiple copies of this email. I > >> sent it > >> >> > to > >> >> > > > the > >> >> > > > >> users list a while ago, and haven't had an answer yet. > >> >> > > > >> > >> >> > > > >> I have been looking for a while if there is any relevant > work > >> that > >> >> > > > >> performed tests on the OpenNLP tools (in particular the > >> Lemmatizer, > >> >> > > > >> Tokenizer and PoS-Tagger) when used with short and noisy > >> texts such > >> >> > as > >> >> > > > >> Twitter data, etc., and/or compared it to other libraries. > >> >> > > > >> > >> >> > > > >> By performances, I mean accuracy/precision, rather than time > >> of > >> >> > > > execution, > >> >> > > > >> etc. > >> >> > > > >> > >> >> > > > >> If anyone can refer me to a paper or a work done in this > >> context, > >> >> > that > >> >> > > > >> would be of great help. > >> >> > > > >> > >> >> > > > >> Thank you very much. > >> >> > > > >> > >> >> > > > >> Mondher > >> >> > > > >> > >> >> > > > > >> >> > > > >> >> > > >> > > >> > > > > >
