This sounds like a fantastic idea. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
On 6/21/16, 12:13 AM, "Anthony Beylerian" <anthonybeyler...@hotmail.com> wrote: >+1 > >Maybe we could put the results of the evaluator tests for each component >somewhere on a webpage and on every release update them. >This is of course provided there are reasonable data sets for testing each >component. >What do you think? > >Anthony > >> From: mondher.bouaz...@gmail.com >> Date: Tue, 21 Jun 2016 15:59:47 +0900 >> Subject: Re: Performances of OpenNLP tools >> To: dev@opennlp.apache.org >> >> Hi, >> >> Thank you for your replies. >> >> Please Jeffrey accept once more my apologies for receiving the email twice. >> >> I also think it would be great to have such studies on the performances of >> OpenNLP. >> >> I have been looking for this information and checked in many places, >> including obviously google scholar, and I haven't found any serious studies >> or reliable results. Most of the existing ones report the performances of >> outdated releases of OpenNLP, and focus more on the execution time or >> CPU/RAM consumption, etc. >> >> I think such a comparison will help not only evaluate the overall accuracy, >> but also highlight the issues with the existing models (as a matter of >> fact, the existing models fail to recognize many of the hashtags in tweets: >> the tokenizer splits them into the "#" symbol and a word that the PoS >> tagger also fails to recognize). >> >> Therefore, building Twitter-based models would also be useful, since many >> of the works in academia / industry are focusing on Twitter data. >> >> Best regards, >> >> Mondher >> >> >> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <jasonbaldri...@gmail.com> >> wrote: >> >> > It would be fantastic to have these numbers. This is an example of >> > something that would be a great contribution by someone trying to >> > contribute to open source and who is maybe just getting into machine >> > learning and natural language processing. >> > >> > For Twitter-ish text, it'd be great to look at models trained and evaluated >> > on the Tweet NLP resources: >> > >> > http://www.cs.cmu.edu/~ark/TweetNLP/ >> > >> > And comparing to how their models performed, etc. Also, it's worth looking >> > at spaCy (Python NLP modules) for further comparisons. >> > >> > https://spacy.io/ >> > >> > -Jason >> > >> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <jzemer...@apache.org> >> > wrote: >> > >> > > I saw the same question on the users list on June 17. At least I thought >> > it >> > > was the same question -- sorry if it wasn't. >> > > >> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) < >> > > chris.a.mattm...@jpl.nasa.gov> wrote: >> > > >> > > > Well, hold on. He sent that mail (as of the time of this mail) 4 >> > > > mins previously. Maybe some folks need some time to reply ^_^ >> > > > >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > > Chris Mattmann, Ph.D. >> > > > Chief Architect >> > > > Instrument Software and Science Data Systems Section (398) >> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> > > > Office: 168-519, Mailstop: 168-527 >> > > > Email: chris.a.mattm...@nasa.gov >> > > > WWW: http://sunset.usc.edu/~mattmann/ >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > > Director, Information Retrieval and Data Science Group (IRDS) >> > > > Adjunct Associate Professor, Computer Science Department >> > > > University of Southern California, Los Angeles, CA 90089 USA >> > > > WWW: http://irds.usc.edu/ >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <jzemer...@apache.org> wrote: >> > > > >> > > > >Hi Mondher, >> > > > > >> > > > >Since you didn't get any replies I'm guessing no one is aware of any >> > > > >resources related to what you need. Google Scholar is a good place to >> > > look >> > > > >for papers referencing OpenNLP and its methods (in case you haven't >> > > > >searched it already). >> > > > > >> > > > >Jeff >> > > > > >> > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi < >> > > > >mondher.bouaz...@gmail.com> wrote: >> > > > > >> > > > >> Hi, >> > > > >> >> > > > >> Apologies if you received multiple copies of this email. I sent it >> > to >> > > > the >> > > > >> users list a while ago, and haven't had an answer yet. >> > > > >> >> > > > >> I have been looking for a while if there is any relevant work that >> > > > >> performed tests on the OpenNLP tools (in particular the Lemmatizer, >> > > > >> Tokenizer and PoS-Tagger) when used with short and noisy texts such >> > as >> > > > >> Twitter data, etc., and/or compared it to other libraries. >> > > > >> >> > > > >> By performances, I mean accuracy/precision, rather than time of >> > > > execution, >> > > > >> etc. >> > > > >> >> > > > >> If anyone can refer me to a paper or a work done in this context, >> > that >> > > > >> would be of great help. >> > > > >> >> > > > >> Thank you very much. >> > > > >> >> > > > >> Mondher >> > > > >> >> > > > >> > > >> > >