Re: Performances of OpenNLP tools

Jason Baldridge Tue, 21 Jun 2016 09:19:25 -0700

Jörn is absolutely right about that. Another good source of training data
is MASC. I've got some instructions for training models with MASC here:


https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial

Chalk (now defunct) provided a Scala wrapper around OpenNLP functionality,
so the instructions there should make it fairly straightforward to adapt
MASC data to OpenNLP.

-Jason

On Tue, 21 Jun 2016 at 10:46 Joern Kottmann <[email protected]> wrote:

> There are some research papers which study and compare the performance of
> NLP toolkits, but be careful often they don't train the NLP tools on the
> same data and the training data makes a big difference on the performance.
>
> Jörn
>
> On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann <[email protected]>
> wrote:
>
> > Just don't use the very old existing models, to get good results you have
> > to train on your own data, especially if the domain of the data used for
> > training and the data which should be processed doesn't match. The old
> > models are trained on 90s news, those don't work well on todays news and
> > probably much worse on tweets.
> >
> > OntoNots is a good place to start if the goal is to process news. OpenNLP
> > comes with build-in support to train models from OntoNotes.
> >
> > Jörn
> >
> > On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
> > [email protected]> wrote:
> >
> >> This sounds like a fantastic idea.
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: [email protected]
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Director, Information Retrieval and Data Science Group (IRDS)
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> WWW: http://irds.usc.edu/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 6/21/16, 12:13 AM, "Anthony Beylerian" <[email protected]
> >
> >> wrote:
> >>
> >> >+1
> >> >
> >> >Maybe we could put the results of the evaluator tests for each
> component
> >> somewhere on a webpage and on every release update them.
> >> >This is of course provided there are reasonable data sets for testing
> >> each component.
> >> >What do you think?
> >> >
> >> >Anthony
> >> >
> >> >> From: [email protected]
> >> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
> >> >> Subject: Re: Performances of OpenNLP tools
> >> >> To: [email protected]
> >> >>
> >> >> Hi,
> >> >>
> >> >> Thank you for your replies.
> >> >>
> >> >> Please Jeffrey accept once more my apologies for receiving the email
> >> twice.
> >> >>
> >> >> I also think it would be great to have such studies on the
> >> performances of
> >> >> OpenNLP.
> >> >>
> >> >> I have been looking for this information and checked in many places,
> >> >> including obviously google scholar, and I haven't found any serious
> >> studies
> >> >> or reliable results. Most of the existing ones report the
> performances
> >> of
> >> >> outdated releases of OpenNLP, and focus more on the execution time or
> >> >> CPU/RAM consumption, etc.
> >> >>
> >> >> I think such a comparison will help not only evaluate the overall
> >> accuracy,
> >> >> but also highlight the issues with the existing models (as a matter
> of
> >> >> fact, the existing models fail to recognize many of the hashtags in
> >> tweets:
> >> >> the tokenizer splits them into the "#" symbol and a word that the PoS
> >> >> tagger also fails to recognize).
> >> >>
> >> >> Therefore, building Twitter-based models would also be useful, since
> >> many
> >> >> of the works in academia / industry are focusing on Twitter data.
> >> >>
> >> >> Best regards,
> >> >>
> >> >> Mondher
> >> >>
> >> >>
> >> >>
> >> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <
> >> [email protected]>
> >> >> wrote:
> >> >>
> >> >> > It would be fantastic to have these numbers. This is an example of
> >> >> > something that would be a great contribution by someone trying to
> >> >> > contribute to open source and who is maybe just getting into
> machine
> >> >> > learning and natural language processing.
> >> >> >
> >> >> > For Twitter-ish text, it'd be great to look at models trained and
> >> evaluated
> >> >> > on the Tweet NLP resources:
> >> >> >
> >> >> > http://www.cs.cmu.edu/~ark/TweetNLP/
> >> >> >
> >> >> > And comparing to how their models performed, etc. Also, it's worth
> >> looking
> >> >> > at spaCy (Python NLP modules) for further comparisons.
> >> >> >
> >> >> > https://spacy.io/
> >> >> >
> >> >> > -Jason
> >> >> >
> >> >> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <
> [email protected]>
> >> >> > wrote:
> >> >> >
> >> >> > > I saw the same question on the users list on June 17. At least I
> >> thought
> >> >> > it
> >> >> > > was the same question -- sorry if it wasn't.
> >> >> > >
> >> >> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann, Chris A (3980) <
> >> >> > > [email protected]> wrote:
> >> >> > >
> >> >> > > > Well, hold on. He sent that mail (as of the time of this mail)
> 4
> >> >> > > > mins previously. Maybe some folks need some time to reply ^_^
> >> >> > > >
> >> >> > > >
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >> > > > Chris Mattmann, Ph.D.
> >> >> > > > Chief Architect
> >> >> > > > Instrument Software and Science Data Systems Section (398)
> >> >> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> >> > > > Office: 168-519, Mailstop: 168-527
> >> >> > > > Email: [email protected]
> >> >> > > > WWW:  http://sunset.usc.edu/~mattmann/
> >> >> > > >
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >> > > > Director, Information Retrieval and Data Science Group (IRDS)
> >> >> > > > Adjunct Associate Professor, Computer Science Department
> >> >> > > > University of Southern California, Los Angeles, CA 90089 USA
> >> >> > > > WWW: http://irds.usc.edu/
> >> >> > > >
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick" <[email protected]>
> >> wrote:
> >> >> > > >
> >> >> > > > >Hi Mondher,
> >> >> > > > >
> >> >> > > > >Since you didn't get any replies I'm guessing no one is aware
> >> of any
> >> >> > > > >resources related to what you need. Google Scholar is a good
> >> place to
> >> >> > > look
> >> >> > > > >for papers referencing OpenNLP and its methods (in case you
> >> haven't
> >> >> > > > >searched it already).
> >> >> > > > >
> >> >> > > > >Jeff
> >> >> > > > >
> >> >> > > > >On Mon, Jun 20, 2016 at 11:19 AM, Mondher Bouazizi <
> >> >> > > > >[email protected]> wrote:
> >> >> > > > >
> >> >> > > > >> Hi,
> >> >> > > > >>
> >> >> > > > >> Apologies if you received multiple copies of this email. I
> >> sent it
> >> >> > to
> >> >> > > > the
> >> >> > > > >> users list a while ago, and haven't had an answer yet.
> >> >> > > > >>
> >> >> > > > >> I have been looking for a while if there is any relevant
> work
> >> that
> >> >> > > > >> performed tests on the OpenNLP tools (in particular the
> >> Lemmatizer,
> >> >> > > > >> Tokenizer and PoS-Tagger) when used with short and noisy
> >> texts such
> >> >> > as
> >> >> > > > >> Twitter data, etc., and/or compared it to other libraries.
> >> >> > > > >>
> >> >> > > > >> By performances, I mean accuracy/precision, rather than time
> >> of
> >> >> > > > execution,
> >> >> > > > >> etc.
> >> >> > > > >>
> >> >> > > > >> If anyone can refer me to a paper or a work done in this
> >> context,
> >> >> > that
> >> >> > > > >> would be of great help.
> >> >> > > > >>
> >> >> > > > >> Thank you very much.
> >> >> > > > >>
> >> >> > > > >> Mondher
> >> >> > > > >>
> >> >> > > >
> >> >> > >
> >> >> >
> >> >
> >>
> >
> >
>

Re: Performances of OpenNLP tools

Reply via email to