Re: Joshua Model Input Format(s) and LM Loading

Matt Post Tue, 25 Oct 2016 05:50:22 -0700

Hi Lewis,

Joshua supports two language model representation packages: KenLM [0] and 
BerkeleyLM [1]. These were both developed at about the same time, and 
represented huge gains in doing this task efficiently, over what had previously 
been the standard approach (SRILM). Ken Heafield (who has contributed a lot to 
Joshua) went on to contribute a lot of other improvements to language model 
representation, decoder integration, and also the actual construction of 
language models and their efficient interpolation. His goal for a while was to 
make SRILM completely unnecessary, and I think he succeeded.

BerkeleyLM was more of a one-off project. It is slower than KenLM and hasn't 
been touched in years. If you want to understand, your efforts are probably 
best spent looking into KenLM papers. But it's also worth noting that Ken is a 
crack C++ programmer who has spent years hacking away on these problems, and 
your chances of finding any further efficiencies there are probably quite 
limited unless you have a lot of background in the area. But even if you did, I 
would recommend you not spend your time that way — I basically consider the LM 
representation problem to have been solved by KenLM. That's not to say that 
there are some improvements to be had on the Joshua / JNI bridge, but even 
there, there are probably better things to do.

matt

[0] KenLM: Faster and Smaller Language Model Queries
http://www.kheafield.com/professional/avenue/kenlm.pdf

[1] Faster and Smaller N-Gram Language Models
http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf

> On Oct 24, 2016, at 10:21 PM, lewis john mcgibbney <lewi...@apache.org> wrote:
> 
> Hi Folks,
> I have set out with the aim of learning more about the underlying Joshua
> language model serialization(s) e.g. statistical n-gram model in ARPA
> format [0] as well as trying to JProfile a Joshua server running to better
> understand how objects are used and what runtime memory usage looks like
> for typical translation tasks.
> This has lead me to think about the fundamental performance issues we
> experience when loading large LM's into memory in the first place... and
> the efficiency of searching models regardless of whether they are cached in
> memory (e.g. Joshua server), or not.
> Does anyone have detailed technical/journal documentation which would set
> me in the right direction to address the above area?
> Thanks
> Lewis
> 
> [0]
> http://cmusphinx.sourceforge.net/wiki/sphinx4:standardgrammarformats#statistical_n-gram_models_in_the_arpa_format
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney

Re: Joshua Model Input Format(s) and LM Loading

Reply via email to