Hi Lewis, Joshua supports two language model representation packages: KenLM [0] and BerkeleyLM [1]. These were both developed at about the same time, and represented huge gains in doing this task efficiently, over what had previously been the standard approach (SRILM). Ken Heafield (who has contributed a lot to Joshua) went on to contribute a lot of other improvements to language model representation, decoder integration, and also the actual construction of language models and their efficient interpolation. His goal for a while was to make SRILM completely unnecessary, and I think he succeeded.
BerkeleyLM was more of a one-off project. It is slower than KenLM and hasn't been touched in years. If you want to understand, your efforts are probably best spent looking into KenLM papers. But it's also worth noting that Ken is a crack C++ programmer who has spent years hacking away on these problems, and your chances of finding any further efficiencies there are probably quite limited unless you have a lot of background in the area. But even if you did, I would recommend you not spend your time that way — I basically consider the LM representation problem to have been solved by KenLM. That's not to say that there are some improvements to be had on the Joshua / JNI bridge, but even there, there are probably better things to do. matt [0] KenLM: Faster and Smaller Language Model Queries http://www.kheafield.com/professional/avenue/kenlm.pdf [1] Faster and Smaller N-Gram Language Models http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf > On Oct 24, 2016, at 10:21 PM, lewis john mcgibbney <lewi...@apache.org> wrote: > > Hi Folks, > I have set out with the aim of learning more about the underlying Joshua > language model serialization(s) e.g. statistical n-gram model in ARPA > format [0] as well as trying to JProfile a Joshua server running to better > understand how objects are used and what runtime memory usage looks like > for typical translation tasks. > This has lead me to think about the fundamental performance issues we > experience when loading large LM's into memory in the first place... and > the efficiency of searching models regardless of whether they are cached in > memory (e.g. Joshua server), or not. > Does anyone have detailed technical/journal documentation which would set > me in the right direction to address the above area? > Thanks > Lewis > > [0] > http://cmusphinx.sourceforge.net/wiki/sphinx4:standardgrammarformats#statistical_n-gram_models_in_the_arpa_format > > -- > http://home.apache.org/~lewismc/ > @hectorMcSpector > http://www.linkedin.com/in/lmcgibbney