[ https://issues.apache.org/jira/browse/JOSHUA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236608#comment-17236608 ]
Tommaso Teofili commented on JOSHUA-338: ---------------------------------------- that'd be great to have you pursuing this for GSoC, feel free to ask any question and/or write on the Joshua mailing list (dev@joshua.apache.org). > Generate smaller models for LPs > ------------------------------- > > Key: JOSHUA-338 > URL: https://issues.apache.org/jira/browse/JOSHUA-338 > Project: Joshua > Issue Type: Task > Components: core > Reporter: Tommaso Teofili > Priority: Major > Labels: gsoc2019 > > Phrase tables and grammars can get very big when trained on lots of parallel > data, which makes it hard to distribute them in Language Packs. A quick way > to reduce model size is to reduce the amount of parallel data used to build > models, but sampling a subset of it. This is the very naive approach used in > the construction of the original language packs (November 2016), but there > are much better ways. One relatively simple one is the Vocabulary Saturation > Filter (VSF), proposed by Will Lewis and Sauleh Eetemadi and described in > paper [1]. It would be wonderful to implement this and use it to do a better > job selecting which sentences to include for our general-purpose language > packs. > It would be ideal to implement this in Java, but Python or Scala would also > fit well inside Joshua. > [1] : http://www.aclweb.org/anthology/W/W13/W13-2235.pdf -- This message was sent by Atlassian Jira (v8.3.4#803005)