[jira] [Commented] (OPENNLP-830) Huge runtime improvement on training (POS, Chunk, ...)

Tommaso Teofili (JIRA) Sat, 16 Jan 2016 23:01:26 -0800

    [ 
https://issues.apache.org/jira/browse/OPENNLP-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103623#comment-15103623
 ]


Tommaso Teofili commented on OPENNLP-830:
-----------------------------------------

[[email protected]] while I agree that looks like a bit of legacy 
code, it'd be good to have a clear evidence of such improvements. Are you able 
to produce a test case and / or provide link to code where we can see such 
performance gains?


> Huge runtime improvement on training (POS, Chunk, ...)
> ------------------------------------------------------
>
>                 Key: OPENNLP-830
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-830
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Machine Learning, POS Tagger
>    Affects Versions: 1.6.0
>         Environment: Any
>            Reporter: Julien Subercaze
>              Labels: performance
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> opennlp.tools.ml.model.IndexHashTable is custom-made Hashtable that is used 
> to store mapping index. This Hashtable is heavily used in openlp.tools.ml.* 
> (i.e. every model) and leads to disastrous performance.
> This hashtable is probably legacy some legacy and is highly inefficient. A 
> simple drop-in replacement by a java.util.HashMap wrapper solves the issue, 
> doesn't break compatibility and does not add any dependency.
> Training a pos-tagger on a large dataset with custom tags, I see a factor 5 
> improvement. It also seems to improve all ML models training pipeline.
> See : 
> https://github.com/jsubercaze/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/ml/model/IndexHashTable.java
> For a quick fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OPENNLP-830) Huge runtime improvement on training (POS, Chunk, ...)

Reply via email to