[D] Evaluating OpenNLP (from old wiki) [texera]

via GitHub Mon, 20 Oct 2025 22:14:49 -0700


GitHub user chenlica created a discussion: Evaluating OpenNLP (from old wiki)


>From the wiki page https://github.com/apache/texera/wiki/Evaluating-OpenNLP 
>(may be dangling)

=====

Author: Hailey Pan

Reviewed by: Chen Li

##Example Code

* `sandbox/src/main/java/edu/uci/ics/texera/sandbox/OpenNLPexample`
* https://github.com/Texera/texera/pull/308

##Introduction

The Apache OpenNLP library (https://opennlp.apache.org/) is a machine learning 
based toolkit for the processing of natural language text.

Both Stanford CoreNLP and OpenNLP require tokenization before doing any 
extraction. Stanford CoreNLP can extract all tags specified by a single 
annotator (Figure 1). On the other hand, OpenNLP has a trained model for each 
specific tag, i.e., it has a model for POS tags, a model for location NER tags, 
a model for person NER tags, and so on.

![Figure 
1](https://docs.google.com/drawings/d/1C96GiXWWxDPlePFlHLVb6mnos82zvwTKiBk3I3oHXwQ/pub?w=157632&h=14592)
Figure 1: Stanford NLP Sample Code: annotation property setup

In terms of named-entity recognition, Stanford CoreNLP works better on 
general-purpose text. OpenNLP might be a better choice when one wants to 
extract information from text by using their own models trained on a corpus. 

##Models

Available models can be found at http://opennlp.sourceforge.net/models-1.5/.  
For instance, the "Part of Speech Tagger" marks tokens with their corresponding 
word types based on its context. A token might have multiple POS tags. The 
tagger uses a probability model to predict the correct POS tags. To limit the 
number of possible tags for a token, a tag dictionary can be used, which 
increases the runtime of the tagger.

##Performance & Accuracy

Dataset: abstract_100.txt

Machine: 2016 MacBook Pro with 8GB RAM and 256GB SSD


**_Runtime (tokenizing runtime included)_**

| | OpenNLP | Stanford CoreNLP |
|:---:|:---:|:---:|
| POS | 11.65s | 2.69s |
| NER | 11.26s | 18.04s |

Stanford CoreNLP is much more efficient than OpenNLP in POS tagging. OpenNLP 
runs faster than Stanford CoreNLP in NER tagging. However, Stanford CoreNLP 
extracts all NER tags while OpenNLP extracts only location tags.

**POSTag Results**

Most irrelevant results (e.g., punctuations) are eliminated for both packages.

| OpenNLP | Stanford CoreNLP |
|:---:|:---:|
| 26,360 results | 25,919 Results |
 
The results produced by these two tools are very similar. The difference 
between their number of results may be due to their own tokenizer. Stanford 
CoreNLP tokenizer does a better job in handling punctuations. For example, 
OpenNLP recognizes [.”] as a token and tags it as coordinating conjunction 
("CC"), but Stanford NLP would not tag it. This is the main reason that OpenNLP 
produces more results than Stanford NLP. Therefore, Stanford CoreNLP may have a 
higher accuracy over OpenNLP because of more accurate tokenization. 

More results are available in the following links:

* Stanford CoreNLP: 
[https://drive.google.com/open?id=0B-d4eVox97e3UXpUMUpfckxSb28](https://drive.google.com/open?id=0B-d4eVox97e3UXpUMUpfckxSb28)
* OpenNLP: 
[https://drive.google.com/open?id=0B-d4eVox97e3bnZyT2gzX1dJX3c](https://drive.google.com/open?id=0B-d4eVox97e3bnZyT2gzX1dJX3c)

**NER Results**

OpenNLP uses a location NER model only, so we only compare the location NER 
results.  OpenNLP provides results as offsets, e.g., “New York” as a result, 
while Stanford CoreNLP produces “New” and “York.”  For easy comparison, the 
results of OpenNLP are separated word by word.  OpenNLP cannot figure out 
abbreviations that contain punctuations of a location name while Stanford NLP 
can. For example, OpenNLP doesn’t tag “N.Y.” but Stanford NLP does. Also, 
Stanford NLP can recognize non-English alphabetical-based words, while OpenNLP 
needs another model to do it. Overall, Stanford CoreNLP tends to be more 
accurate. 

| OpenNLP | Stanford CoreNLP |
|:---:|:---:|
| 150 results | 173 Results |

More results are available in the following links:

* Stanford CoreNLP: 
[https://drive.google.com/open?id=0B-d4eVox97e3Ym5HRm9nbEdiSk0](https://drive.google.com/open?id=0B-d4eVox97e3Ym5HRm9nbEdiSk0)
* OpenNLP: 
[https://drive.google.com/open?id=0B-d4eVox97e3OHNfRmNZcU92ZGs](https://drive.google.com/open?id=0B-d4eVox97e3OHNfRmNZcU92ZGs)


##Popularity of OpenNLP

![](https://docs.google.com/drawings/d/19lHT0woeFTPjTITdBVAYjYnk-Wn7KlEENBgthEKkEk8/pub?w=200640&h=47040)
![](https://docs.google.com/drawings/d/1aADOu-JU0FIwUnxG02NHlQpIxW6BQBtlH8rh_ge3oZA/pub?w=188352&h=65472)
![](https://docs.google.com/drawings/d/1naVVICKCVe2NhkNjykCqTsk1mDAsnFHtV8pxhZDAHhY/pub?w=176064&h=79104)

Source: https://java.libhunt.com/project/corenlp/vs/apache-opennlp




GitHub link: https://github.com/apache/texera/discussions/3965

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

[D] Evaluating OpenNLP (from old wiki) [texera]

Reply via email to