GitHub user KellenSunderland opened a pull request:

    https://github.com/apache/incubator-joshua/pull/48

    Fixed crashing when using Trie based KenLM models with State Minimiza…

    …tion
    
    There are two separate issues that are causing very rare crashes when using 
KenLM with Joshua.  The only configuration under which we've seen crashing, is 
when using more than one Trie-based state-minimizing model.  One of these 
issues is addressed in this patch, another will have to be contributed to the 
KenLM repo.
    
    This first patch deals with hash collisions in the mapping between a given 
KenLM state, and the stored memory address of the state in KenLM’s memory 
pool.  Previously these collisions were ignored.  The problem with ignoring 
collisions here is that the two models can have different Vocabulary sizes.  
This means if you collide and take the state of a model with a larger 
vocabulary you could potentially have a unigram word id that is out of range.  
When searched for this will cause a crash.
    
    The second issue (to be patched in KenLM) is that the equality operator in 
state.hh isn't comparing enough values to properly differentiate between states 
during a collision.
    
    -----
    
    I did some performance regression testing after making this change.  
Running with a few affected models I did 5 rounds of 30K translations each.  
From my testing the performance (of Joshua) was not impacted.
    
    -----
    
    A note about applying the two fixes: they can be applied independently.  
Apply this patch first to Joshua will mitigate the issue.  It will then happen 
an order of magnitude less often.  After the second patch has been applied in 
KenLM we can simply merge it at any point to fully fix the issue.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/KellenSunderland/incubator-joshua 
kenlm_pool_fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-joshua/pull/48.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #48
    
----
commit 8e7e7776b8b0fa862f08750b1dddd93f9e5f3487
Author: Kellen Sunderland <[email protected]>
Date:   2016-08-30T08:52:07Z

    Fixed crashing when using Trie based KenLM models with State Minimization
    
    There are two separate issues that are causing very rare crashes when using 
KenLM with Joshua.  The only configuration under which we've seen crashing, is 
when using more than one Trie-based state-minimizing model.  One of these 
issues is addressed in this patch, another will have to be contributed to the 
KenLM repo.
    
    This first patch deals with hash collisions in the mapping between a given 
KenLM state, and the stored memory address of the state in KenLM’s memory 
pool.  Previously these collisions were ignored.  The problem with ignoring 
collisions here is that the two models can have different Vocabulary sizes.  
This means if you collide and take the state of a model with a larger 
vocabulary you could potentially have a unigram word id that is out of range.  
When searched for this will cause a crash.
    
    The second issue (to be patched in KenLM) is that the equality operator in 
state.hh isn't comparing enough values to properly differentiate between states 
during a collision.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to