updated packing info
Project: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/commit/700e8581 Tree: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/tree/700e8581 Diff: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/diff/700e8581 Branch: refs/heads/asf-site Commit: 700e85817db9f6dce152131b196f8b34195b2255 Parents: 4b3bdd3 Author: Matt Post <[email protected]> Authored: Mon Jun 22 22:17:43 2015 -0400 Committer: Matt Post <[email protected]> Committed: Mon Jun 22 22:17:43 2015 -0400 ---------------------------------------------------------------------- 6.0/packing.md | 91 +++++++++++++++++---------------------------- _layouts/default6.html | 1 + 2 files changed, 35 insertions(+), 57 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/700e8581/6.0/packing.md ---------------------------------------------------------------------- diff --git a/6.0/packing.md b/6.0/packing.md index e911095..8189c66 100644 --- a/6.0/packing.md +++ b/6.0/packing.md @@ -4,73 +4,50 @@ category: advanced title: Grammar Packing --- -Grammar packing refers to the process of taking a textual grammar output by [Thrax](thrax.html) and -efficiently encoding it for use by Joshua. Packing the grammar results in significantly faster load -times for very large grammars. +Grammar packing refers to the process of taking a textual grammar +output by [Thrax](thrax.html) (or Moses, for phrase-based models) and +efficiently encoding it so that it can be loaded +[very quickly](https://aclweb.org/anthology/W/W12/W12-3134.pdf) --- +packing the grammar results in significantly faster load times for +very large grammars. Packing is done automatically by the +[Joshua pipeline](pipeline.html), but you can also run the packer +manually. -Soon, the [Joshua pipeline script](pipeline.html) will add support for grammar packing -automatically, and we will provide a script that automates these steps for you. +The script can be found at +`$JOSHUA/scripts/support/grammar-packer.pl`. See that script for +example usage. You can then add it to a Joshua config file, simply +replacing a `tm` path to the compressed text-file format with a path +to the packed grammar directory (Joshua will automatically detect that +it is packed. -1. Make sure the grammar is labeled. A labeled grammar is one that has feature names attached to -each of the feature values in each row of the grammar file. Here is a line from an unlabeled -grammar: +Packing the grammar requires first sorting it, which can take quite a +bit of temporary space. - [X] ||| [X,1] ঠনà§à¦¯à¦¾à¦¨à§à¦¯ [X,2] ||| [X,1] other [X,2] ||| 0 0 1 0 0 1.02184 +*CAVEAT*: You may run into problems packing very large hiero + grammars. Email the support list if you do. - and here is one from an labeled grammar (note that the labels are not very useful): +### Examples - [X] ||| [X,1] ঠনà§à¦¯à¦¾à¦¨à§à¦¯ [X,2] ||| [X,1] other [X,2] ||| f1=0 f2=0 f3=1 f4=0 f5=0 f6=1.02184 +A Hiero grammar, using the compressed text file version: - If your grammar is not labeled, you can use the script `$JOSHUA/scripts/label_grammar.py`: - - zcat grammar.gz | $JOSHUA/scripts/label_grammar.py > grammar-labeled.gz + tm = hiero -owner pt -maxspan 20 -path grammar.filtered.gz + +Pack it: - As a side-effect of this step is to produce a file 'dense_map' in the current directory, - containing the mapping between feature names and feature columns. This file is needed in later - steps. + $JOSHUA/scripts/support/grammar-packer.pl grammar.filtered.gz grammar.packed -1. The packer needs a sorted grammar. It is sufficient to sort by the first word: +Pack a really big grammar: - zcat grammar-labeled.gz | sort -k3,3 | gzip > grammar-sorted.gz - - (The reason we need a sorted grammar is because the packer stores the grammar in a trie. The - pieces can't be more than 2 GB due to Java limitations, so we need to ensure that rules are - grouped by the first arc in the trie to avoid redundancy across tries and to simplify the - lookup). - -1. In order to pack the grammar, we need two pieces of information: (1) a packer configuration file, - and (2) a dense map file. - - 1. Write a packer config file. This file specifies items such as the chunk size (for the packed - pieces) and the quantization classes and types for each feature name. Examples can be found - at - - $JOSHUA/test/packed/packer.config - $JOSHUA/test/bn-en/packed/packer.quantized - $JOSHUA/test/bn-en/packed/packer.uncompressed - - The quantizer lines in the packer config file have the following format: - - quantizer TYPE FEATURES - - where `TYPE` is one of `boolean`, `float`, `byte`, or `8bit`, and `FEATURES` is a - space-delimited list of feature names that have that quantization type. - - 1. Write a dense_map file. If you labeled an unlabeled grammar, this was produced for you as a - side product of the `label_grammar.py` script you called in Step 1. Otherwise, you need to - create a file that lists the mapping between feature names and (0-indexed) columns in the - grammar, one per line, in the following format: - - feature-index feature-name - -1. To pack the grammar, type the following command: + $JOSHUA/scripts/support/grammar-packer.pl -m 30g grammar.filtered.gz grammar.packed + +Be a little more verbose: + + $JOSHUA/scripts/support/grammar-packer.pl -m 30g grammar.filtered.gz grammar.packed - java -cp $JOSHUA/bin joshua.tools.GrammarPacker -c PACKER_CONFIG_FILE -p OUTPUT_DIR -g GRAMMAR_FILE +You have a different temp file location: - This will read in your packer configuration file and your grammar, and produced a packed grammar - in the output directory. + $JOSHUA/scripts/support/grammar-packer.pl -T /local grammar.filtered.gz grammar.packed -1. To use the packed grammar, just point to the packed directory in your Joshua configuration file. +Update the config file line: - tm-file = packed-grammar/ - tm-format = packed + tm = hiero -owner pt -maxspan 20 -path grammar.packed http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/700e8581/_layouts/default6.html ---------------------------------------------------------------------- diff --git a/_layouts/default6.html b/_layouts/default6.html index 3d19a7b..3737c63 100644 --- a/_layouts/default6.html +++ b/_layouts/default6.html @@ -86,6 +86,7 @@ <li><a href="/6.0/bundle.html">Building language packs</a></li> <li><a href="/6.0/decoder.html">Decoder options</a></li> <li><a href="/6.0/file-formats.html">File formats</a></li> + <li><a href="/6.0/packing.html">Packing TMs</a></li> </ol> </div>
