[ 
https://issues.apache.org/jira/browse/JOSHUA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236608#comment-17236608
 ] 

Tommaso Teofili commented on JOSHUA-338:
----------------------------------------

that'd be great to have you pursuing this for GSoC, feel free to ask any 
question and/or write on the Joshua mailing list (dev@joshua.apache.org).

> Generate smaller models for LPs
> -------------------------------
>
>                 Key: JOSHUA-338
>                 URL: https://issues.apache.org/jira/browse/JOSHUA-338
>             Project: Joshua
>          Issue Type: Task
>          Components: core
>            Reporter: Tommaso Teofili
>            Priority: Major
>              Labels: gsoc2019
>
> Phrase tables and grammars can get very big when trained on lots of parallel 
> data, which makes it hard to distribute them in Language Packs. A quick way 
> to reduce model size is to reduce the amount of parallel data used to build 
> models, but sampling a subset of it. This is the very naive approach used in 
> the construction of the original language packs (November 2016), but there 
> are much better ways. One relatively simple one is the Vocabulary Saturation 
> Filter (VSF), proposed by Will Lewis and Sauleh Eetemadi and described in 
> paper [1]. It would be wonderful to implement this and use it to do a better 
> job selecting which sentences to include for our general-purpose language 
> packs.
> It would be ideal to implement this in Java, but Python or Scala would also 
> fit well inside Joshua.
> [1] : http://www.aclweb.org/anthology/W/W13/W13-2235.pdf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to