updated LM specification
Project: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/commit/601d9f8b Tree: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/tree/601d9f8b Diff: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/diff/601d9f8b Branch: refs/heads/master Commit: 601d9f8ba40af8d8c75a46823d3ded717f379739 Parents: f74fc8c Author: Matt Post <[email protected]> Authored: Fri Jun 19 22:22:39 2015 -0400 Committer: Matt Post <[email protected]> Committed: Fri Jun 19 22:22:39 2015 -0400 ---------------------------------------------------------------------- 6.0/decoder.md | 45 +++++++++++++++++++++++++++------------------ 1 file changed, 27 insertions(+), 18 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/601d9f8b/6.0/decoder.md ---------------------------------------------------------------------- diff --git a/6.0/decoder.md b/6.0/decoder.md index 149f5e6..e8dc8c9 100644 --- a/6.0/decoder.md +++ b/6.0/decoder.md @@ -212,24 +212,33 @@ For reference, the following two translation model lines are used by the [pipeli ### Language model options <a id="lm" /> -Joshua supports any number of language models. To add a language -model, add a line of the following format to the configuration file: - - lm = TYPE ORDER LEFT_STATE RIGHT_STATE CEILING_COST FILE - -where the six fields correspond to the following values: - -* `TYPE`: one of "kenlm", "berkeleylm", or "none" -* `ORDER`: the order of the language model -* `LEFT_STATE`: whether to use left-state minimization; currently only supported by KenLM -* `RIGHT_STATE`: whether to use right equivalent state (currently unsupported) -* `CEILING_COST`: the LM-specific ceiling cost of all n-grams (currently ignored) -* `FILE`: the path to the language model file. All language model types support the standard ARPA - format. Additionally, if the LM type is "kenlm", this file can be compiled into KenLM's compiled - format (using the program at `$JOSHUA/bin/build_binary`); if the the LM type is "berkeleylm", it - can be compiled by following the directions in - `$JOSHUA/src/joshua/decoder/ff/lm/berkeley_lm/README`. The [pipeline](pipeline.html) will - automatically compile either type. +Joshua supports any number of language models. With Joshua 6.0, these +are just regular feature functions: + + feature-function = LanguageModel -lm_file /path/to/lm/file -lm_order N -lm_type TYPE + feature-function = StateMinimizingLanguageModel -lm_file /path/to/lm/file -lm_order N -lm_type TYPE + +`LanguageModel` is a generic language model, supporting types 'kenlm' +(the default) and 'berkeleylm'. `StateMinimizingLanguageModel` +implements LM state minimization to reduce the size of context n-grams +where appropriate +([Li and Khudanpur, 2008](http://www.aclweb.org/anthology/W08-0402.pdf); +[Heafield et al., 2013](https://aclweb.org/anthology/N/N13/N13-1116.pdf)). This +is currently only supported by KenLM, so the `-lm_type` option is not +available here. + +The other key/value pairs are defined as follows: + +* `lm_type`: one of "kenlm" "berkeleylm" +* `lm_order`: the order of the language model +* `lm_file`: the path to the language model file. All language model + types support the standard ARPA format. Additionally, if the LM + type is "kenlm", this file can be compiled into KenLM's compiled + format (using the program at `$JOSHUA/bin/build_binary`); if the + the LM type is "berkeleylm", it can be compiled by following the + directions in + `$JOSHUA/src/joshua/decoder/ff/lm/berkeley_lm/README`. The + [pipeline](pipeline.html) will automatically compile either type. For each language model, you need to specify a feature weight in the following format:
