Tommaso Teofili created JOSHUA-338:
--------------------------------------

             Summary: Generate smaller models for LPs
                 Key: JOSHUA-338
                 URL: https://issues.apache.org/jira/browse/JOSHUA-338
             Project: Joshua
          Issue Type: Task
          Components: core
            Reporter: Tommaso Teofili


Phrase tables and grammars can get very big when trained on lots of parallel 
data, which makes it hard to distribute them in Language Packs. A quick way to 
reduce model size is to reduce the amount of parallel data used to build 
models, but sampling a subset of it. This is the very naive approach used in 
the construction of the original language packs (November 2016), but there are 
much better ways. One relatively simple one is the Vocabulary Saturation Filter 
(VSF), proposed by Will Lewis and Sauleh Eetemadi and described in paper [1]. 
It would be wonderful to implement this and use it to do a better job selecting 
which sentences to include for our general-purpose language packs.

It would be ideal to implement this in Java, but Python or Scala would also fit 
well inside Joshua.

[1] : http://www.aclweb.org/anthology/W/W13/W13-2235.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to