[ 
https://issues.apache.org/jira/browse/JOSHUA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236336#comment-17236336
 ] 

Tommaso Teofili commented on JOSHUA-338:
----------------------------------------

Hi [~Kishani Kandasamy] you can have a look at how language models are 
currently used within Joshua in the related section on the docs: 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630#TheJoshuaPipeline(6.1)-Languagemodel.
Currently Joshua supports KenLM and BerkeleyLM implementations, any 
improvements might undergo into those implementations or within a new language 
model implementation which is more compact. 
One idea we discussed a while ago was to try out OpenNLP language modeling 
capabilities because of licensing issues with other such libraries.
It has to be said nowadays best language models are coming from BERT models & 
co. (e.g. https://arxiv.org/abs/1909.11687), there's a bit of research to do 
here in terms of the best tradeoff between accuracy, computation requirements, 
speed, storage size, etc.


> Generate smaller models for LPs
> -------------------------------
>
>                 Key: JOSHUA-338
>                 URL: https://issues.apache.org/jira/browse/JOSHUA-338
>             Project: Joshua
>          Issue Type: Task
>          Components: core
>            Reporter: Tommaso Teofili
>            Priority: Major
>              Labels: gsoc2019
>
> Phrase tables and grammars can get very big when trained on lots of parallel 
> data, which makes it hard to distribute them in Language Packs. A quick way 
> to reduce model size is to reduce the amount of parallel data used to build 
> models, but sampling a subset of it. This is the very naive approach used in 
> the construction of the original language packs (November 2016), but there 
> are much better ways. One relatively simple one is the Vocabulary Saturation 
> Filter (VSF), proposed by Will Lewis and Sauleh Eetemadi and described in 
> paper [1]. It would be wonderful to implement this and use it to do a better 
> job selecting which sentences to include for our general-purpose language 
> packs.
> It would be ideal to implement this in Java, but Python or Scala would also 
> fit well inside Joshua.
> [1] : http://www.aclweb.org/anthology/W/W13/W13-2235.pdf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to