Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

Joern Kottmann Wed, 11 Nov 2015 13:42:05 -0800

Hello,

It is definitely true that OpenNLP exists for a long time (more than 10
years), but that doesn't mean it wasn't improved. Actually it changed a
lot in that period.

The core strength of OpenNLP was always that it can be used really easy
to perform one of the supported NLP tasks.

This was further improved with the 1.5 release adding model packages
that ensure that the components are always instantiated correctly across
different runtime environments. 

 The problem is that the system used to perform the training of a model
and the system used to run it can be quite different. Prior 1.5 it was
possible to get that wrong, which resulted in hard to notice performance
problems.
I suspect that is an issue many of the competing solutions still have
today. 

An example is the usage of String.toLowerCase(). The output out it
depends on the platform local.

One of the things that got dated a bit was the machine learning part of
OpneNLP, this was addressed by adding more algorithms (e.g. perceptron
and L-BFGS maxent). In addition the machine learning part is now
plugable and can easily be switched against a different implementation
for testing or production use. The sandbox contains an experimental
mallet integration which offers all the mallet classifiers even CRF can
be used.

On Fri, 2015-11-06 at 16:54 +0000, Mattmann, Chris A (3980) wrote:
> Hi Everyone,
> 
> Hope you¹re well! I¹m new to the list, however I just wanted to
> state I¹m really happy with what I¹ve seen with OpenNLP so far.
> My team and I have built a Tika Parser [1] that uses OpenNLP¹s
> location NER model, along with a Lucene Geo Names Gazeeteer [2]
> to create a ³GeoTopicParser². We are improving it day to day.
> OpenNLP has definitely come a long way from when I looked at it
> years ago in its nascence.

Do you use the old sourcefoge location model?  

> That said, I keep hearing from people I talk to in the NLP community
> that for example OpenNLP is ³old², and that I should be looking
> at e.g., Stanford¹s NER, and NTLK, etc. Besides obvious license
> issues (NER is GPL as an example and I am only interested in ALv2
> or permissive license code), I don¹t have a great answer to whether
> or not OpenNLP is old or not active, or not as good, etc. Can
> devs on this list help me answer that question? I¹d like to be
> able to tell these NLP people the next time I talk to them that
> no, in fact, OpenNLP isn¹t *old*, and it¹s active, and there are
> these X and Y lines of development, and here¹s where they are
> going, etc.

One of the big issues is that OpenNLP only works well if the model is
suitable for the use case of the user. The pre-trained models are
usually not and people tend to judge us by the performance of those old
models.

People who benchmark NLP software often put OpenNLP in a bad light by
comparing different solutions based on their pre-trained models. In my
opinion a fair comparison of statistical NLP software is only possible
if they all use exactly the same training data as done in one of the
many shared tasks, otherwise it is like comparing lap times of two cars
driving on two different race tracks.

I personally use OpenNLP a lot at work and it runs in many of our
customized NLP pipelines to analyze incoming texts. In those pipelines
it is often used in a highly customized setup and trained on the data
that is processed by that pipeline.

I think that must be its single biggest strength. It can be easily used
and customized, almost everything can be done without much hassle.
All components have a custom factory that can be used to change
fundamentally how components are put together, including custom code at
almost any place.

OpenNLP used to be maxent only, but that changed quite some time ago,
perceptron was added years back and now we just received a naive bayes
contribution and since 1.6.0 there is plug-able machine learning
support. Existing machine learning libraries such as mallet can be
integrated and used by all components, e.g training a mallet crf or
maxent based model is now possible (mallet integration can be found in
the sandbox).

Also small things are happening all the time. The name finder now has
support for word cluster dictionaries which can make all the difference
between a badly and very well performing model.

Other NLP libraries such as the mentioned Stanford NER are missing
features. OpenNLP comes with built-in evolution support and can
benchmark itself on a dataset by using cross validation, this includes
tools which output detailed information about recognition mistakes, it
has built-in support for many different corpora, such as OntoNotes,
CONLL or user created brat corpora.

The main difference between OpenNLP and Stanford NER / NLTK is probably
that OpenNLP is developed for usage in production systems and not for
researching or teaching purposes.

Jörn

Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

Reply via email to