[ 
https://issues.apache.org/jira/browse/OPENNLP-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103720#comment-15103720
 ] 

Joern Kottmann commented on OPENNLP-830:
----------------------------------------

This map used to be faster than java.util.HashMap. I believe the reason for 
that is that the map needs less memory than java.util.HashMap and therefore 
used to fit better in the CPU cache.

I think the map we have here is slow because the load factor is rather 
aggressive. Maybe we can do a few tests and tune that parameter and see where 
we stand then.

Otherwise we could just try to get an even faster implementation from one of 
the license compatible libraries with high performance primitive collections.

> Huge runtime improvement on training (POS, Chunk, ...)
> ------------------------------------------------------
>
>                 Key: OPENNLP-830
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-830
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Machine Learning, POS Tagger
>    Affects Versions: 1.6.0
>         Environment: Any
>            Reporter: Julien Subercaze
>              Labels: performance
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> opennlp.tools.ml.model.IndexHashTable is custom-made Hashtable that is used 
> to store mapping index. This Hashtable is heavily used in openlp.tools.ml.* 
> (i.e. every model) and leads to disastrous performance.
> This hashtable is probably legacy some legacy and is highly inefficient. A 
> simple drop-in replacement by a java.util.HashMap wrapper solves the issue, 
> doesn't break compatibility and does not add any dependency.
> Training a pos-tagger on a large dataset with custom tags, I see a factor 5 
> improvement. It also seems to improve all ML models training pipeline.
> See : 
> https://github.com/jsubercaze/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/ml/model/IndexHashTable.java
> For a quick fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to