[ 
https://issues.apache.org/jira/browse/OPENNLP-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103995#comment-15103995
 ] 

Julien Subercaze edited comment on OPENNLP-830 at 1/17/16 11:53 PM:
--------------------------------------------------------------------

Hi there,

first of all, my bad, performance gain of this fix is not a 5 fold as I wrote 
yesterday, I measured 1.65x improvement. I thought I deactivated all other 
optims, but my code was such a mess ... Anyway I still have good news regarding 
performance improvement.

To answer your question, I created a test project to measure performance impact 
on both training and tagging. The project is here:
https://github.com/jsubercaze/opennlp-harness
and the README contains links to the required files.

I cannot release the data I'm working on and a major problem was to find a free 
large dataset to train the model. I didn't found any, so I concatened several 
ebooks from Gutenberg projects and created a 'silver' training set using the 
en-maxent model.

Not so sure this is the right place, but it I introduced several other 
optimizations that I imported/cleaned from my mess. The project is to be found 
here (numbered 1.6.2-SNAPSHOT for testing purpose)
https://github.com/jsubercaze/opennlp-tools/commits/
Each optimization is a single commit, so you can move back
https://github.com/jsubercaze/opennlp-tools/commits/trunk

So here come the numbers (i5 2500K, Xms3G Xmx5G) :

*Building the dataset (tagging) on the Gutenberg small
1.6.0  :
  - Exec time :    454502ms
  - Throughput :   3295.87 sentences/sec
  
Hashtable fix (test pass)
   - Exec time :   273929ms
  - Throughput :   5468.50242456639
  
  # Using Fast exponential (from commons.math3) in the eval method  (test pass)
   - Exec time :    214200ms
  - Throughput :    6993.36  sentences/sec
  
  
*Training a POS Tagger on the Gutenberg small 
 
    - 1.6.0                    :  356867ms
    - HashTable fix            :  326677ms
    - Multithreading log model :  209315ms
    - Writer thread for update :  201648ms
    

Multithreading log model : Maxent offers multhreading (must be cleaned), but
parameters is neither present and default value is 1. Change to use all cores

Writer thread for update : In the TwoPassDataIndexer, blocking I/O slows down 
the process, move to non-blocking (Java 5 compatible way, there is way simpler 
in >= 7).  



 


was (Author: [email protected]):
Hi there,

first of all, my bad, performance gain of this fix is not a 5 fold as I wrote 
yesterday, I measured 1.65x improvement. I thought I deactivated all other 
optims, but my code was such a mess ... Anyway I still have good news regarding 
performance improvement.

To answer your question, I created a test project to measure performance impact 
on both training and tagging. The project is here:
https://github.com/jsubercaze/opennlp-harness
and the README contains links to the required files.

I cannot release the data I'm working on and a major problem was to find a free 
large dataset to train the model. I didn't found any, so I concatened several 
ebooks from Gutenberg projects and created a 'silver' training set using the 
en-maxent model.

Not so sure this is the right place, but it I introduced several other 
optimizations that I imported/cleaned from my mess. The project is to be found 
here (numbered 1.6.2-SNAPSHOT for testing purpose)
https://github.com/jsubercaze/opennlp-tools/commits/
Each optimization is a single commit, so you can move back
https://github.com/jsubercaze/opennlp-tools/commits/trunk

So here come the numbers (i5 2500K, Xms3G Xmx5G) :

####Building the dataset (tagging) on the Gutenberg small
  # 1.6.0  :
  - Exec time :    454502ms
  - Throughput :   3295.87 sentences/sec
  
  # Hashtable fix (test pass)
   - Exec time :   273929ms
  - Throughput :   5468.50242456639
  
  # Using Fast exponential (from commons.math3) in the eval method  (test pass)
   - Exec time :    214200ms
  - Throughput :    6993.36  sentences/sec
  
  
###Training a POS Tagger on the Gutenberg small 
 
    # 1.6.0                    :  356867ms
    # HashTable fix            :  326677ms
    # Multithreading log model :  209315ms
    # Writer thread for update :  201648ms
    

Multithreading log model : Maxent offers multhreading (must be cleaned), but
parameters is neither present and default value is 1. Change to use all cores

Writer thread for update : In the TwoPassDataIndexer, blocking I/O slows down 
the process, move to non-blocking (Java 5 compatible way, there is way simpler 
in >= 7).  



 

> Huge runtime improvement on training (POS, Chunk, ...)
> ------------------------------------------------------
>
>                 Key: OPENNLP-830
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-830
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Machine Learning, POS Tagger
>    Affects Versions: 1.6.0
>         Environment: Any
>            Reporter: Julien Subercaze
>              Labels: performance
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> opennlp.tools.ml.model.IndexHashTable is custom-made Hashtable that is used 
> to store mapping index. This Hashtable is heavily used in openlp.tools.ml.* 
> (i.e. every model) and leads to disastrous performance.
> This hashtable is probably legacy some legacy and is highly inefficient. A 
> simple drop-in replacement by a java.util.HashMap wrapper solves the issue, 
> doesn't break compatibility and does not add any dependency.
> Training a pos-tagger on a large dataset with custom tags, I see a factor 5 
> improvement. It also seems to improve all ML models training pipeline.
> See : 
> https://github.com/jsubercaze/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/ml/model/IndexHashTable.java
> For a quick fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to