[ https://issues.apache.org/jira/browse/JOSHUA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kishani Kandasamy updated JOSHUA-338: ------------------------------------- Comment: was deleted (was: Hi Tommaso Teofili, Thank you for your reply. I'm particularly interested in this issue to complete as my GSoC 2021 Project. Currently , I'm reading Language models used within Joshua in order to understand project scope thoroughly.Thank you. On Fri, Nov 20, 2020 at 11:19 PM Tommaso Teofili (Jira) <j...@apache.org> ) > Generate smaller models for LPs > ------------------------------- > > Key: JOSHUA-338 > URL: https://issues.apache.org/jira/browse/JOSHUA-338 > Project: Joshua > Issue Type: Task > Components: core > Reporter: Tommaso Teofili > Priority: Major > Labels: gsoc2019 > > Phrase tables and grammars can get very big when trained on lots of parallel > data, which makes it hard to distribute them in Language Packs. A quick way > to reduce model size is to reduce the amount of parallel data used to build > models, but sampling a subset of it. This is the very naive approach used in > the construction of the original language packs (November 2016), but there > are much better ways. One relatively simple one is the Vocabulary Saturation > Filter (VSF), proposed by Will Lewis and Sauleh Eetemadi and described in > paper [1]. It would be wonderful to implement this and use it to do a better > job selecting which sentences to include for our general-purpose language > packs. > It would be ideal to implement this in Java, but Python or Scala would also > fit well inside Joshua. > [1] : http://www.aclweb.org/anthology/W/W13/W13-2235.pdf -- This message was sent by Atlassian Jira (v8.3.4#803005)