GitHub user chenlica created a discussion: Evaluating OpenNLP (from old wiki)
>From the wiki page https://github.com/apache/texera/wiki/Evaluating-OpenNLP >(may be dangling) ===== Author: Hailey Pan Reviewed by: Chen Li ##Example Code * `sandbox/src/main/java/edu/uci/ics/texera/sandbox/OpenNLPexample` * https://github.com/Texera/texera/pull/308 ##Introduction The Apache OpenNLP library (https://opennlp.apache.org/) is a machine learning based toolkit for the processing of natural language text. Both Stanford CoreNLP and OpenNLP require tokenization before doing any extraction. Stanford CoreNLP can extract all tags specified by a single annotator (Figure 1). On the other hand, OpenNLP has a trained model for each specific tag, i.e., it has a model for POS tags, a model for location NER tags, a model for person NER tags, and so on.  Figure 1: Stanford NLP Sample Code: annotation property setup In terms of named-entity recognition, Stanford CoreNLP works better on general-purpose text. OpenNLP might be a better choice when one wants to extract information from text by using their own models trained on a corpus. ##Models Available models can be found at http://opennlp.sourceforge.net/models-1.5/. For instance, the "Part of Speech Tagger" marks tokens with their corresponding word types based on its context. A token might have multiple POS tags. The tagger uses a probability model to predict the correct POS tags. To limit the number of possible tags for a token, a tag dictionary can be used, which increases the runtime of the tagger. ##Performance & Accuracy Dataset: abstract_100.txt Machine: 2016 MacBook Pro with 8GB RAM and 256GB SSD **_Runtime (tokenizing runtime included)_** | | OpenNLP | Stanford CoreNLP | |:---:|:---:|:---:| | POS | 11.65s | 2.69s | | NER | 11.26s | 18.04s | Stanford CoreNLP is much more efficient than OpenNLP in POS tagging. OpenNLP runs faster than Stanford CoreNLP in NER tagging. However, Stanford CoreNLP extracts all NER tags while OpenNLP extracts only location tags. **POSTag Results** Most irrelevant results (e.g., punctuations) are eliminated for both packages. | OpenNLP | Stanford CoreNLP | |:---:|:---:| | 26,360 results | 25,919 Results | The results produced by these two tools are very similar. The difference between their number of results may be due to their own tokenizer. Stanford CoreNLP tokenizer does a better job in handling punctuations. For example, OpenNLP recognizes [.”] as a token and tags it as coordinating conjunction ("CC"), but Stanford NLP would not tag it. This is the main reason that OpenNLP produces more results than Stanford NLP. Therefore, Stanford CoreNLP may have a higher accuracy over OpenNLP because of more accurate tokenization. More results are available in the following links: * Stanford CoreNLP: [https://drive.google.com/open?id=0B-d4eVox97e3UXpUMUpfckxSb28](https://drive.google.com/open?id=0B-d4eVox97e3UXpUMUpfckxSb28) * OpenNLP: [https://drive.google.com/open?id=0B-d4eVox97e3bnZyT2gzX1dJX3c](https://drive.google.com/open?id=0B-d4eVox97e3bnZyT2gzX1dJX3c) **NER Results** OpenNLP uses a location NER model only, so we only compare the location NER results. OpenNLP provides results as offsets, e.g., “New York” as a result, while Stanford CoreNLP produces “New” and “York.” For easy comparison, the results of OpenNLP are separated word by word. OpenNLP cannot figure out abbreviations that contain punctuations of a location name while Stanford NLP can. For example, OpenNLP doesn’t tag “N.Y.” but Stanford NLP does. Also, Stanford NLP can recognize non-English alphabetical-based words, while OpenNLP needs another model to do it. Overall, Stanford CoreNLP tends to be more accurate. | OpenNLP | Stanford CoreNLP | |:---:|:---:| | 150 results | 173 Results | More results are available in the following links: * Stanford CoreNLP: [https://drive.google.com/open?id=0B-d4eVox97e3Ym5HRm9nbEdiSk0](https://drive.google.com/open?id=0B-d4eVox97e3Ym5HRm9nbEdiSk0) * OpenNLP: [https://drive.google.com/open?id=0B-d4eVox97e3OHNfRmNZcU92ZGs](https://drive.google.com/open?id=0B-d4eVox97e3OHNfRmNZcU92ZGs) ##Popularity of OpenNLP    Source: https://java.libhunt.com/project/corenlp/vs/apache-opennlp GitHub link: https://github.com/apache/texera/discussions/3965 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
