Julien Subercaze created OPENNLP-830:
----------------------------------------
Summary: Huge runtime improvement on training (POS, Chunk, ...)
Key: OPENNLP-830
URL: https://issues.apache.org/jira/browse/OPENNLP-830
Project: OpenNLP
Issue Type: Improvement
Components: Machine Learning, POS Tagger
Affects Versions: 1.6.0
Environment: Any
Reporter: Julien Subercaze
opennlp.tools.ml.model.IndexHashTable is custom-made Hashtable that is used to
store mapping index. This Hashtable is heavily used in openlp.tools.ml.* (i.e.
every model) and leads to disastrous performance.
This hashtable is probably legacy some legacy and is highly inefficient. A
simple drop-in replacement by a java.util.HashMap wrapper solves the issue,
doesn't break compatibility and does not add any dependency.
Training a pos-tagger on a large dataset with custom tags, I see a factor 5
improvement. It also seems to improve all ML models training pipeline.
See :
https://github.com/jsubercaze/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/ml/model/IndexHashTable.java
For a quick fix.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)