Initial import of joshua-decoder.github.com site to Apache
Project: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/commit/ccc92816 Tree: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/tree/ccc92816 Diff: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/diff/ccc92816 Branch: refs/heads/asf-site Commit: ccc928165df2cd288b9fd7152f56a9be6cd3fc33 Parents: Author: Lewis John McGibbney <[email protected]> Authored: Mon Apr 4 22:16:48 2016 -0700 Committer: Lewis John McGibbney <[email protected]> Committed: Mon Apr 4 22:16:48 2016 -0700 ---------------------------------------------------------------------- 4.0/decoder.md | 910 +++ 4.0/faq.md | 7 + 4.0/features.md | 7 + 4.0/file-formats.md | 78 + 4.0/index.md | 48 + 4.0/large-lms.md | 192 + 4.0/lattice.md | 17 + 4.0/packing.md | 76 + 4.0/pipeline.md | 576 ++ 4.0/step-by-step-instructions.html | 908 +++ 4.0/thrax.md | 14 + 4.0/tms.md | 106 + 4.0/zmert.md | 83 + 5.0/advanced.md | 7 + 5.0/bundle.md | 24 + 5.0/decoder.md | 374 ++ 5.0/faq.md | 7 + 5.0/features.md | 6 + 5.0/file-formats.md | 72 + 5.0/index.md | 77 + 5.0/jacana.md | 139 + 5.0/large-lms.md | 192 + 5.0/packing.md | 76 + 5.0/pipeline.md | 640 ++ 5.0/server.md | 30 + 5.0/thrax.md | 14 + 5.0/tms.md | 106 + 5.0/tutorial.md | 174 + 5.0/zmert.md | 83 + 6.0/advanced.md | 7 + 6.0/bundle.md | 100 + 6.0/decoder.md | 385 ++ 6.0/faq.md | 161 + 6.0/features.md | 6 + 6.0/file-formats.md | 72 + 6.0/index.md | 24 + 6.0/install.md | 88 + 6.0/jacana.md | 139 + 6.0/large-lms.md | 192 + 6.0/packing.md | 74 + 6.0/pipeline.md | 666 ++ 6.0/quick-start.md | 59 + 6.0/server.md | 30 + 6.0/thrax.md | 14 + 6.0/tms.md | 106 + 6.0/tutorial.md | 187 + 6.0/whats-new.md | 12 + 6.0/zmert.md | 83 + 6/advanced.md | 7 + 6/bundle.md | 100 + 6/decoder.md | 385 ++ 6/faq.md | 161 + 6/features.md | 6 + 6/file-formats.md | 72 + 6/index.md | 24 + 6/install.md | 88 + 6/jacana.md | 139 + 6/large-lms.md | 192 + 6/packing.md | 74 + 6/pipeline.md | 666 ++ 6/quick-start.md | 59 + 6/server.md | 30 + 6/thrax.md | 14 + 6/tms.md | 106 + 6/tutorial.md | 187 + 6/whats-new.md | 12 + 6/zmert.md | 83 + CNAME | 1 + README.md | 42 + _config.yml | 5 + _data/joshua.yaml | 2 + _layouts/default.html | 169 + _layouts/default4.html | 94 + _layouts/default6.html | 200 + _layouts/documentation.html | 60 + blog.css | 171 + bootstrap/css/bootstrap-responsive.css | 1109 +++ bootstrap/css/bootstrap-responsive.min.css | 9 + bootstrap/css/bootstrap.css | 6167 +++++++++++++++++ bootstrap/css/bootstrap.min.css | 9 + bootstrap/img/glyphicons-halflings-white.png | Bin 0 -> 8777 bytes bootstrap/img/glyphicons-halflings.png | Bin 0 -> 12799 bytes bootstrap/js/bootstrap.js | 2280 +++++++ bootstrap/js/bootstrap.min.js | 6 + contributors.md | 44 + data/fisher-callhome-corpus/images/lattice.png | Bin 0 -> 22684 bytes data/fisher-callhome-corpus/index.html | 94 + data/index.html | 7 + data/indian-parallel-corpora/images/map1.png | Bin 0 -> 59635 bytes data/indian-parallel-corpora/images/map2.png | Bin 0 -> 51311 bytes data/indian-parallel-corpora/index.html | 111 + devel/index.html | 16 + dist/css/bootstrap-theme.css | 470 ++ dist/css/bootstrap-theme.css.map | 1 + dist/css/bootstrap-theme.min.css | 5 + dist/css/bootstrap.css | 6332 ++++++++++++++++++ dist/css/bootstrap.css.map | 1 + dist/css/bootstrap.min.css | 5 + dist/fonts/glyphicons-halflings-regular.eot | Bin 0 -> 20335 bytes dist/fonts/glyphicons-halflings-regular.svg | 229 + dist/fonts/glyphicons-halflings-regular.ttf | Bin 0 -> 41280 bytes dist/fonts/glyphicons-halflings-regular.woff | Bin 0 -> 23320 bytes dist/js/bootstrap.js | 2320 +++++++ dist/js/bootstrap.min.js | 7 + dist/js/npm.js | 13 + fisher-callhome-corpus/index.html | 1 + images/desert.jpg | Bin 0 -> 121958 bytes images/joshua-logo-small.png | Bin 0 -> 29235 bytes images/joshua-logo.jpg | Bin 0 -> 236977 bytes images/joshua-logo.pdf | Bin 0 -> 1465851 bytes images/joshua-logo.png | Bin 0 -> 858713 bytes images/logo-credits.txt | 1 + images/sponsors/NSF-logo.jpg | Bin 0 -> 38008 bytes images/sponsors/darpa-logo.jpg | Bin 0 -> 11552 bytes images/sponsors/euromatrix.png | Bin 0 -> 59093 bytes images/sponsors/hltcoe-logo1.jpg | Bin 0 -> 8278 bytes images/sponsors/hltcoe-logo1.png | Bin 0 -> 22031 bytes images/sponsors/hltcoe-logo2.jpg | Bin 0 -> 8803 bytes images/sponsors/hltcoe-logo2.png | Bin 0 -> 9767 bytes images/sponsors/hltcoe-logo3.png | Bin 0 -> 34899 bytes index.md | 43 + index5.html | 237 + indian-parallel-corpora/index.html | 1 + joshua.bib | 12 + joshua.css | 44 + joshua4.css | 184 + joshua6.css | 220 + language-packs.csv | 2 + language-packs/ar-en-phrase/index.html | 16 + language-packs/es-en-phrase/index.html | 16 + language-packs/index.md | 69 + language-packs/paraphrase/index.md | 8 + language-packs/zh-en-hiero/index.html | 16 + publications/joshua-2.0.pdf | Bin 0 -> 95757 bytes publications/joshua-3.0.pdf | Bin 0 -> 198854 bytes ...lkit-for-statistical-machine-translation.pdf | Bin 0 -> 105762 bytes releases.md | 61 + releases/5.0/index.html | 16 + releases/6.0/index.html | 4 + releases/current/index.html | 4 + releases/index.md | 11 + releases/runtime/index.html | 4 + style.css | 237 + support/index.md | 25 + 144 files changed, 31064 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/decoder.md ---------------------------------------------------------------------- diff --git a/4.0/decoder.md b/4.0/decoder.md new file mode 100644 index 0000000..e3839bf --- /dev/null +++ b/4.0/decoder.md @@ -0,0 +1,910 @@ +--- +layout: default4 +category: links +title: Decoder configuration parameters +--- + +Joshua configuration parameters affect the runtime behavior of the decoder itself. This page +describes the complete list of these parameters and describes how to invoke the decoder manually. + +To run the decoder, a convenience script is provided that loads the necessary Java libraries. +Assuming you have set the environment variable `$JOSHUA` to point to the root of your installation, +its syntax is: + + $JOSHUA/joshua-decoder [-m memory-amount] [-c config-file other-joshua-options ...] + +The `-m` argument, if present, must come first, and the memory specification is in Java format +(e.g., 400m, 4g, 50g). Most notably, the suffixes "m" and "g" are used for "megabytes" and +"gigabytes", and there cannot be a space between the number and the unit. The value of this +argument is passed to Java itself in the invocation of the decoder, and the remaining options are +passed to Joshua. The `-c` parameter has special import because it specifies the location of the +configuration file. + +The Joshua decoder works by reading from STDIN and printing translations to STDOUT as they are +received, according to a number of [output options](#output). If no run-time parameters are +specified (e.g., no translation model), sentences are simply pushed through untranslated. Blank +lines are similarly pushed through as blank lines, so as to maintain parallelism with the input. + +Parameters can be provided to Joshua via a configuration file and from the command +line. Command-line arguments override values found in the configuration file. The format for +configuration file parameters is + + parameter = value + +Command-line options are specified in the following format + + -parameter value + +Values are one of four types (which we list here mostly to call attention to the boolean format): + +- STRING, an arbitrary string (no spaces) +- FLOAT, a floating-point value +- INT, an integer +- BOOLEAN, a boolean value. For booleans, `true` evaluates to true, and all other values evaluate + to false. For command-line options, the value may be omitted, in which case it evaluates to + true. For example, the following are equivalent: + + $JOSHUA/joshua-decoder -show-align-index true + $JOSHUA/joshua-decoder -show-align-index + +## Joshua configuration file + +Before describing the list of Joshua parameters, we present a note about the configuration file. +In addition to the decoder parameters described below, the configuration file contains the feature +weight values for the model. The weight values are distinguished from runtime parameters in two +ways: (1) they cannot be overridden on the command line, and (2) they do not have an equals sign +(=). Parameters are described in further detail in the [feature file](features.html). They take +the following format, and by convention are placed at the end of the configuration file: + + lm 0 4.23 + phrasement pt 0 -0.2 + oovpenalty -100 + +## Joshua decoder parameters + +This section contains a list of the Joshua run-time parameters. An important note about the +parameters is that they are collapsed to canonical form, in which dashes (-) and underscores (-) are +removed and case is converted to lowercase. For example, the following parameter forms are +equivalent (either in the configuration file or from the command line): + + {top-n, topN, top_n, TOP_N, t-o-p-N} + {poplimit, pop-limit, pop-limit, popLimit} + +This basically defines equivalence classes of parameters, and relieves you of the task of having to +remember the exact format of each parameter. + +In what follows, we group the configuration parameters in the following groups: + +- [Alternate modes of operation](#modes) +- [General options](#general) +- [Pruning](#pruning) +- [Translation model options](#tm) +- [Language model options](#lm) +- [Output options](#output) + +<a name="modes" /> + +### Alternate modes of operation + +In addition to decoding (which is the default mode), Joshua can also produce synchronous parses of a +(source,target) pair of sentences. This mode disables the language model (since no generation is +required) but still requires a translation model. To enable it, you must do two things: + +1. Set the configuration parameters `parse = true`. +2. Provide input in the following format: + + source sentence ||| target sentence + +You may also wish to display the synchronouse parse tree (`-use-tree-nbest`) and the alignment +(`-show-align-index`). + +The synchronous parsing implementation is that of Dyer (2010) +[PDF](http://www.aclweb.org/anthology/N/N10/N10-1033). + +If parsing is enabled, the following features become relevant. If you would like more information +about how to use these features, please ask [Jonny Weese](http://cs.jhu.edu/~jonny/) to document +them. + +- `forest-pruning` --- *false* + + If true, the synchronous forest will be pruned. + +- `forest-pruning-threshold` --- *10* + + The threshold used for pruning. + +- `use-kbest-hg` --- *false* + + The k-best hypergraph to use. + + +<a name="general" /> + +### General decoder options + +- `c`, `config` --- *NULL* + + Specifies the configuration file from which Joshua options are loaded. This feature is unique in + that it must be specified from the command line. + +- `oracle-file` --- *NULL* + + The location of a set of oracle reference translations, parallel to the input. When present, + after producing the hypergraph by decoding the input sentence, the oracle is used to rescore the + translation forest with a BLEU approximation in order to extract the oracle-translation from the + forest. This is useful for obtaining an (approximation to an) upper bound on your translation + model under particular search settings. + +- `default-nonterminal` --- *"X"* + + This is the nonterminal symbol assigned to out-of-vocabulary (OOV) items. + +- `goal-symbol` --- *"GOAL"* + + This is the symbol whose presence in the chart over the whole input span denotes a successful + parse (translation). It should match the LHS nonterminal in your glue grammar. Internally, + Joshua represents nonterminals enclosed in square brackets (e.g., "[GOAL]"), which you can + optionally supply in the configuration file. + +- `true-oovs-only` --- *false* + + By default, Joshua creates an OOV entry for every word in the source sentence, regardless of + whether it is found in the grammar. This allows every word to be pushed through untranslated + (although potentially incurring a high cost based on the `oovPenalty` feature). If this option is + set, then only true OOVs are entered into the chart as OOVs. + +- `use-sent-specific-tm` --- *false* + + If set to true, Joshua will look for sentence-specific filtered grammars. The location is + determined by taking the supplied translation model (`tm-file`) and looking for a `filtered/` + subdirectory for a file with the same name but with the (0-indexed) sentence number appended to + it. For example, if + + tm-file = /path/to/grammar.gz + + then the sentence-filtered grammars should be found at + + /path/to/filtered/grammar.0.gz + /path/to/filtered/grammar.1.gz + /path/to/filtered/grammar.2.gz + ... + +- `threads`, `num-parallel-decoders` --- *1* + + This determines how many simultaneous decoding threads to launch. + + Outputs are assembled in order and Joshua has to hold on to the complete target hypergraph until + it is ready to be processed for output, so too many simultaneous threads could result in lots of + memory usage if a long sentence results in many sentences being queued up. We have run Joshua + with as many as 48 threads without any problems of this kind, but it's useful to keep in the back + of your mind. + +- `oov-feature-cost` --- *100* + + Each OOV word incurs this cost, which is multiplied against the `oovPenalty` feature (which is + tuned but can be held fixed). + +- `use-google-linear-corpus-gain` +- `google-bleu-weights` + + +<a name="pruning" /> + +### Pruning options + +There are three different approaches to pruning in Joshua. + +1. No pruning. Exhaustive decoding is triggered by setting `pop-limit = 0` and +`use-beam-and-threshold-prune = false`. + +1. The old approach. This approach uses a handful of pruning parameters whose specific roles are +hard to understand and whose interaction is even more difficult to quantify. It is triggered by +setting `pop-limit = 0` and `use-beam-and-threshold-prune = true`. + +1. Pop-limit pruning (the new approach). The pop limit determines the number of hypotheses that are + popped from the candidates list for each of the O(n^2) spans of the input. A nice feature of this + approach is that it provides a single value to control the size of the search space that is + explored (and therefore runtime). + +Selecting among these pruning methods could be made easier via a single parameter with enumerated +values, but currently, we are stuck with this slightly more cumbersome way. The defaults ensure +that you don't have to worry about them too much. Pop-limit pruning is enabled by default, and it +is the recommended approach; if you want to control the speed / accuracy tradeoff, you should change +the pop limit. + +- `pop-limit` --- *100* + + The number of hypotheses to examine for each span of the input. Higher values result in a larger + portion of the search space being explored at the cost of an increased search time. + +- `use-beam-and-threshold-pruning` --- *false* + + Enables the use of beam-and-threshold pruning, and makes the following five features relevant. + + - `fuzz1` --- *0.1* + - `fuzz2` --- *0.2* + - `max-n-items` --- *30* + - `relative-threshold` --- *10.0* + - `max-n-rules` --- *50* + +- `constrain-parse` --- *false* +- `use_pos_labels` --- *false* + + +<a name="tm" /> + +### Translation model options + +At the moment, Joshua supports only two translation models, which are designated as the (main) +translation model and the glue grammar. Internally, these grammars are distinguished only in that +the `span-limit` parameter applies only to the glue grammar. In the near future we plan to +generalize the grammar specification to permit an unlimited number of translation models. + +The main translation grammar is specified with the following set of parameters: + +- `tm_file STRING` --- *NULL*, `glue_file STRING` --- *NULL* + + This points to the file location of the translation grammar for text-based formats or to the + directory for the [packed representation](packing.html). + +- `tm_format STRING` --- *thrax*, `glue_format STRING` --- *thrax* + + The format the file is in. The permissible formats are `hiero` or `thrax` (which are equivalent), + `packed` (for [packed grammars](packing.html)), or `samt` (for grammars encoded in the format + defined by [Zollmann & Venugopal](http://www.cs.cmu.edu/~zollmann/samt/). This parameter will be + done away with in the near future since it is easily inferrable. See + [the formats page](file-formats.html) for more information about file formats. + +- `phrase_owner STRING` --- *pt*, `glue-owner STRING` --- *pt* + + The ownership concept is used to distinguish the set of feature weights that apply to each + grammar. See the [page on features](features.html) for more information. By default, these + parameters have the same value, meaning the grammars share a set of features. + +- `span-limit` --- *10* + + This controls the maximum span of the input that grammar rules loaded from `tm-file` are allowed + to apply. The span limit is ignored for glue grammars. + +<a name="lm" /> + +### Language model options + +Joshua supports the incorporation of an arbitrary number of language models. To add a language +model, add a line of the following format to the configuration file: + + lm = lm-type order 0 0 lm-ceiling-cost lm-file + +where the six fields correspond to the following values: + +* *lm-type*: one of "kenlm", "berkeleylm", "javalm" (not recommended), or "none" +* *order*: the N of the N-gram language model +* *0*: whether to use left equivalent state (currently not supported) +* *0*: whether to use right equivalent state (currently not supported) +* *lm-ceiling-cost*: the LM-specific ceiling cost of any n-gram (currently ignored; + `lm-ceiling-cost` applies to all language models) +* *lm-file*: the path to the language model file. All types support the standard ARPA format. + Additionally, if the LM type is "kenlm", this file can be compiled into KenLM's compiled format + (using the program at `$JOSHUA/src/joshua/decoder/ff/lm/kenlm/build_binary`), and if the LM type + is "berkeleylm", it can be compiled by following the directions in + `$JOSHUA/src/joshua/decoder/ff/lm/berkeley_lm/README`. + +For each language model, you need to specify a feature weight in the following format: + + lm 0 WEIGHT + lm 1 WEIGHT + ... + +where the indices correspond to the language model declaration lines in order. + +For backwards compatibility, Joshua also supports a separate means of specifying the language model, +by separately specifying each of `lm-file` (NULL), `lm-type` (kenlm), `order` (5), and +`lm-ceiling-cost` (100). + + +<a name="output" /> + +### Output options + +The output for a given input is a set of one or more lines with the following scheme: + + input ID ||| translation ||| model scores ||| score + +These parameters largely determine what is output by Joshua. + +- `top-n` --- *300* + + The number of translation hypotheses to output, sorted in non-increasing order of model score (i.e., + highest first). + +- `use-unique-nbest` --- *true* + + When constructing the n-best list for a sentence, skip hypotheses whose string has already been + output. This increases the amount of diversity in the n-best list by removing spurious ambiguity + in the derivation structures. + +- `add-combined-cost` --- *true* + + In addition to outputting the hypothesis number, the translation, and the individual feature + weights, output the combined model cost. + +- `use-tree-nbest` --- *false* + + Output the synchronous derivation tree in addition to the output string, for each candidate in the + n-best list. + +- `escape-trees` --- *false* + + +- `include-align-index` --- *false* + + Output the source words indices that each target word aligns to. + +- `mark-oovs` --- *false* + + if `true`, this causes the text "_OOV" to be appended to each OOV in the output. + +- `visualize-hypergraph` --- *false* + + If set to true, a visualization of the hypergraph will be displayed, though you will have to + explicitly include the relevant jar files. See the example usage in + `$JOSHUA/examples/tree_visualizer/`, which contains a demonstration of a source sentence, + translation, and synchronous derivation. + +- `save-disk-hg` --- *false* [DISABLED] + + This feature directs that the hypergraph should be written to disk. The code is in + + $JOSHUA/src/joshua/src/DecoderThread.java + + but the feature has not been tested in some time, and is thus disabled. It probably wouldn't take + much work to fix it! If you do, you might find the + [discussion on a common hypergraph format](http://aclweb.org/aclwiki/index.php?title=Hypergraph_Format) + on the ACL Wiki to be useful. + +<!-- + +## Full list of command-line options and arguments + +<table border="0"> + <tr> + <th> + option + </th> + <th> + value + </th> + <th> + description + </th> + </tr> + + <tr> + <td> + <code>-lm</code> + </td> + <td> + String, e.g. <n /> <code>TYPE 5 false false 100 FILE</code> + </td> + <td markdown="1"> + Use once for each of one or language models. + </td> + </tr> + + <tr> + <td> + <code>-lm_file</code> + </td> + <td> + String: path the the language model file + </td> + <td> + ??? + </td> + </tr> + + <tr> + <td> + <code>-parse</code> + </td> + <td> + None + </td> + <td> + whether to parse (if not then decode) + </td> + </tr> + + <tr> + <td> + <code>-tm_file</code> + </td> + <td> + String + </td> + <td> + path to the the translation model + </td> + </tr> + + <tr> + <td> + <code>-glue_file</code> + </td> + <td> + String + </td> + <td> + ??? + </td> + </tr> + + <tr> + <td> + <code>-tm_format</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>-glue_format</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>-lm_type</code> + </td> + <td> + value + </td> + <td> + description + </td> + </tr> + <tr> + <td> + <code>lm_ceiling_cost</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>use_left_equivalent_state</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>use_right_equivalent_state</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>order</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>use_sent_specific_lm</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>span_limit</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>phrase_owner</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>glue_owner</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>default_non_terminal</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>goalSymbol</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>constrain_parse</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>oov_feature_index</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>true_oovs_only</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>use_pos_labels</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>fuzz1</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>fuzz2</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>max_n_items</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>relative_threshold</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>max_n_rules</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>use_unique_nbest</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>add_combined_cost</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>use_tree_nbest</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>escape_trees</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>include_align_index</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>top_n</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>parallel_files_prefix</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>num_parallel_decoders</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>threads</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>save_disk_hg</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>use_kbest_hg</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>forest_pruning</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>forest_pruning_threshold</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>visualize_hypergraph</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>mark_oovs</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>pop-limit</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> + + <tr> + <td> + <code>useCubePrune</code> + </td> + <td> + String + </td> + <td> + description + </td> + </tr> +</table> +--> + http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/faq.md ---------------------------------------------------------------------- diff --git a/4.0/faq.md b/4.0/faq.md new file mode 100644 index 0000000..f0a4151 --- /dev/null +++ b/4.0/faq.md @@ -0,0 +1,7 @@ +--- +layout: default4 +category: help +title: Common problems +--- + +Solutions to common problems will be posted here as we become aware of them. http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/features.md ---------------------------------------------------------------------- diff --git a/4.0/features.md b/4.0/features.md new file mode 100644 index 0000000..d915c82 --- /dev/null +++ b/4.0/features.md @@ -0,0 +1,7 @@ +--- +layout: default4 +category: links +title: Features +--- + +This file will contain information about the Joshua decoder features. http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/file-formats.md ---------------------------------------------------------------------- diff --git a/4.0/file-formats.md b/4.0/file-formats.md new file mode 100644 index 0000000..c10f906 --- /dev/null +++ b/4.0/file-formats.md @@ -0,0 +1,78 @@ +--- +layout: default4 +category: advanced +title: Joshua file formats +--- +This page describes the formats of Joshua configuration and support files. + +## Translation models (grammars) + +Joshua supports three grammar file formats. + +1. Thrax / Hiero +1. SAMT [deprecated] +1. packed + +The *Hiero* format is not restricted to Hiero grammars, but simply means *the format that David +Chiang developed for Hiero*. It can support a much broader class of SCFGs containing an arbitrary +set of nonterminals. Similarly, the *SAMT* format is not restricted to SAMT grammars but instead +simply denotes *the grammar format that Zollmann and Venugopal developed for their decoder*. To +remove this source of confusion, "thrax" is the preferred format designation, and is in fact the +default. + +The packed grammar format is the efficient grammar representation developed by +[Juri Ganitkevich](http://cs.jhu.edu/~juri) [is described in detail elsewhere](packing.html). + +Grammar rules in the Thrax format follow this format: + + [LHS] ||| SOURCE-SIDE ||| TARGET-SIDE ||| FEATURES + +Here are some two examples, one for a Hiero grammar, and the other for an SAMT grammar: + + [X] ||| el chico [X] ||| the boy [X] ||| -3.14 0 2 17 + [S] ||| el chico [VP] ||| the boy [VP] ||| -3.14 0 2 17 + +The feature values can have optional labels, e.g.: + + [X] ||| el chico [X] ||| the boy [X] ||| lexprob=-3.14 abstract=0 numwords=2 count=17 + +These feature names are made up. For an actual list of feature names, please +[see the Thrax documentation](thrax.html). + +The SAMT grammar format is deprecated and undocumented. + +## Language Model + +Joshua has three language model implementations: [KenLM](), [BerkeleyLM](), and an (unrecommended) +dummy Java implementation. All language model implementations support the standard ARPA format +output by [SRILM](). In addition, KenLM and BerkeleyLM support compiled formats that can be loaded +more quickly and efficiently. + +### Compiling for KenLM + +To compile an ARPA grammar for KenLM, use the (provided) `build-binary` command, located deep within +the Joshua source code: + + $JOSHUA/src/joshua/decoder/ff/lm/kenlm/build_binary lm.arpa lm.kenlm + +This script takes the `lm.arpa` file and produces the compiled version in `lm.kenlm`. + +### Compiling for BerkeleyLM + +To compile a grammar for BerkeleyLM, type: + + java -cp $JOSHUA/lib/berkeleylm.jar -server -mxMEM edu.berkeley.nlp.lm.io.MakeLmBinaryFromArpa lm.arpa lm.berkeleylm + +The `lm.berkeleylm` file can then be listed directly in the [Joshua configuration file](decoder.html). + +## Joshua configuration + +See [the decoder page](decoder.html). + +## Pipeline configuration + +See [the pipeline page](pipeline.html). + +## Thrax configuration + +See [the thrax page](thrax.html). http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/index.md ---------------------------------------------------------------------- diff --git a/4.0/index.md b/4.0/index.md new file mode 100644 index 0000000..ae62e4e --- /dev/null +++ b/4.0/index.md @@ -0,0 +1,48 @@ +--- +layout: default4 +title: Joshua 4.0 User Documentation +--- + +This page contains end-user oriented documentation for the 4.0 release of +[the Joshua decoder](http://joshua-decoder.org/). + +## Download and Setup + +1. Follow [this link](http://cs.jhu.edu/~post/files/joshua-4.0.tgz) to download Joshua, or do it +from the command line: + + wget -q http://cs.jhu.edu/~post/files/joshua-4.0.tgz + +2. Next, unpack it, set the `$JOSHUA` environment variable, and compile everything: + + tar xzf joshua-4.0.tgz + cd joshua-4.0 + + # for bash + export JOSHUA=$(pwd) + echo "export JOSHUA=$JOSHUA" >> ~/.bashrc + + # for tcsh + setenv JOSHUA `pwd` + echo "setenv JOSHUA $JOSHUA" >> ~/.profile + + ant all + +3. That's it. + +## Quick start + +If you just want to run the complete machine translation pipeline (beginning with data preparation, +through alignment, hierarchical model building, tuning, testing, and reporting), we recommend you +use our <a href="pipeline.html">pipeline script</a>. You might also be interested in +[Chris' old walkthrough](http://cs.jhu.edu/~ccb/joshua/). + +## More information + +For more detail on the decoder itself, including its command-line options, see +[the Joshua decoder page](decoder.html). You can also learn more about other steps of +[the Joshua MT pipeline](pipeline.html), including [grammar extraction](thrax.html) with Thrax and +Joshua's [efficient grammar representation](packing.html) (new with version 4.0). + +If you have problems or issues, you might find some help [on our answers page](faq.html) or +[in the mailing list archives](https://groups.google.com/forum/?fromgroups#!forum/joshua_support). http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/large-lms.md ---------------------------------------------------------------------- diff --git a/4.0/large-lms.md b/4.0/large-lms.md new file mode 100644 index 0000000..a4ba5b7 --- /dev/null +++ b/4.0/large-lms.md @@ -0,0 +1,192 @@ +--- +layout: default4 +title: Building large LMs with SRILM +category: advanced +--- + +The following is a tutorial for building a large language model from the +English Gigaword Fifth Edition corpus +[LDC2011T07](http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07) +using SRILM. English text is provided from seven different sources. + +### Step 0: Clean up the corpus + +The Gigaword corpus has to be stripped of all SGML tags and tokenized. +Instructions for performing those steps are not included in this +documentation. A description of this process can be found in a paper +called ["Annotated +Gigaword"](https://akbcwekex2012.files.wordpress.com/2012/05/28_paper.pdf). + +The Joshua package ships with a script that converts all alphabetical +characters to their lowercase equivalent. The script is located at +`$JOSHUA/scripts/lowercase.perl`. + +Make a directory structure as follows: + + gigaword/ + âââ corpus/ + â  âââ afp_eng/ + â  â  âââ afp_eng_199405.lc.gz + â  â  âââ afp_eng_199406.lc.gz + â  â  âââ ... + â  â  âââ counts/ + â  âââ apw_eng/ + â  â  âââ apw_eng_199411.lc.gz + â  â  âââ apw_eng_199412.lc.gz + â  â  âââ ... + â  â  âââ counts/ + â  âââ cna_eng/ + â  â  âââ ... + â  â  âââ counts/ + â  âââ ltw_eng/ + â  â  âââ ... + â  â  âââ counts/ + â  âââ nyt_eng/ + â  â  âââ ... + â  â  âââ counts/ + â  âââ wpb_eng/ + â  â  âââ ... + â  â  âââ counts/ + â  âââ xin_eng/ + â    âââ ... + â    âââ counts/ + âââ lm/ +   âââ afp_eng/ +   âââ apw_eng/ +   âââ cna_eng/ +   âââ ltw_eng/ +   âââ nyt_eng/ +   âââ wpb_eng/ +   âââ xin_eng/ + + +The next step will be to build smaller LMs and then interpolate them into one +file. + +### Step 1: Count ngrams + +Run the following script once from each source directory under the `corpus/` +directory (edit it to specify the path to the `ngram-count` binary as well as +the number of processors): + + #!/bin/sh + + NGRAM_COUNT=$SRILM_SRC/bin/i686-m64/ngram-count + args="" + + for source in *.gz; do + args=$args"-sort -order 5 -text $source -write counts/$source-counts.gz " + done + + echo $args | xargs --max-procs=4 -n 7 $NGRAM_COUNT + +Then move each `counts/` directory to the corresponding directory under +`lm/`. Now that each ngram has been counted, we can make a language +model for each of the seven sources. + +### Step 2: Make individual language models + +SRILM includes a script, called `make-big-lm`, for building large language +models under resource-limited environments. The manual for this script can be +read online +[here](http://www-speech.sri.com/projects/srilm/manpages/training-scripts.1.html). +Since the Gigaword corpus is so large, it is convenient to use `make-big-lm` +even in environments with many parallel processors and a lot of memory. + +Initiate the following script from each of the source directories under the +`lm/` directory (edit it to specify the path to the `make-big-lm` script as +well as the pruning threshold): + + #!/bin/bash + set -x + + CMD=$SRILM_SRC/bin/make-big-lm + PRUNE_THRESHOLD=1e-8 + + $CMD \ + -name gigalm `for k in counts/*.gz; do echo " \ + -read $k "; done` \ + -lm lm.gz \ + -max-per-file 100000000 \ + -order 5 \ + -kndiscount \ + -interpolate \ + -unk \ + -prune $PRUNE_THRESHOLD + +The language model attributes chosen are the following: + +* N-grams up to order 5 +* Kneser-Ney smoothing +* N-gram probability estimates at the specified order *n* are interpolated with + lower-order estimates +* include the unknown-word token as a regular word +* pruning N-grams based on the specified threshold + +Next, we will mix the models together into a single file. + +### Step 3: Mix models together + +Using development text, interpolation weights can determined that give highest +weight to the source language models that have the lowest perplexity on the +specified development set. + +#### Step 3-1: Determine interpolation weights + +Initiate the following script from the `lm/` directory (edit it to specify the +path to the `ngram` binary as well as the path to the development text file): + + #!/bin/bash + set -x + + NGRAM=$SRILM_SRC/bin/i686-m64/ngram + DEV_TEXT=~mpost/expts/wmt12/runs/es-en/data/tune/tune.tok.lc.es + + dirs=( afp_eng apw_eng cna_eng ltw_eng nyt_eng wpb_eng xin_eng ) + + for d in ${dirs[@]} ; do + $NGRAM -debug 2 -order 5 -unk -lm $d/lm.gz -ppl $DEV_TEXT > $d/lm.ppl ; + done + + compute-best-mix */lm.ppl > best-mix.ppl + +Take a look at the contents of `best-mix.ppl`. It will contain a sequence of +values in parenthesis. These are the interpolation weights of the source +language models in the order specified. Copy and paste the values within the +parenthesis into the script below. + +#### Step 3-2: Combine the models + +Initiate the following script from the `lm/` directory (edit it to specify the +path to the `ngram` binary as well as the interpolation weights): + + #!/bin/bash + set -x + + NGRAM=$SRILM_SRC/bin/i686-m64/ngram + DIRS=( afp_eng apw_eng cna_eng ltw_eng nyt_eng wpb_eng xin_eng ) + LAMBDAS=(0.00631272 0.000647602 0.251555 0.0134726 0.348953 0.371566 0.00749238) + + $NGRAM -order 5 -unk \ + -lm ${DIRS[0]}/lm.gz -lambda ${LAMBDAS[0]} \ + -mix-lm ${DIRS[1]}/lm.gz \ + -mix-lm2 ${DIRS[2]}/lm.gz -mix-lambda2 ${LAMBDAS[2]} \ + -mix-lm3 ${DIRS[3]}/lm.gz -mix-lambda3 ${LAMBDAS[3]} \ + -mix-lm4 ${DIRS[4]}/lm.gz -mix-lambda4 ${LAMBDAS[4]} \ + -mix-lm5 ${DIRS[5]}/lm.gz -mix-lambda5 ${LAMBDAS[5]} \ + -mix-lm6 ${DIRS[6]}/lm.gz -mix-lambda6 ${LAMBDAS[6]} \ + -write-lm mixed_lm.gz + +The resulting file, `mixed_lm.gz` is a language model based on all the text in +the Gigaword corpus and with some probabilities biased to the development text +specify in step 3-1. It is in the ARPA format. The optional next step converts +it into KenLM format. + +#### Step 3-3: Convert to KenLM + +The KenLM format has some speed advantages over the ARPA format. Issuing the +following command will write a new language model file `mixed_lm-kenlm.gz` that +is the `mixed_lm.gz` language model transformed into the KenLM format. + + $JOSHUA/src/joshua/decoder/ff/lm/kenlm/build_binary mixed_lm.gz mixed_lm.kenlm + http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/lattice.md ---------------------------------------------------------------------- diff --git a/4.0/lattice.md b/4.0/lattice.md new file mode 100644 index 0000000..5d6bd47 --- /dev/null +++ b/4.0/lattice.md @@ -0,0 +1,17 @@ +--- +layout: default4 +category: advanced +title: Lattice decoding +--- + +In addition to regular sentences, Joshua can decode weighted lattices encoded in [the PLF +format](http://www.statmt.org/moses/?n=Moses.WordLattices). Lattice decoding was originally added +by Lane Schwartz and [Chris Dyer](http://www.cs.cmu.edu/~cdyer/). + +Joshua will automatically detect whether the input sentence is a regular sentence +(the usual case) or a lattice. If a lattice, a feature will be activated that accumulates the cost +of different paths through the lattice. In this case, you need to ensure that a weight for this +feature is present in [your model file](decoder.html). + +The main caveats with Joshua's PLF lattice support are that the lattice needs to be listed on a +single line. http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/packing.md ---------------------------------------------------------------------- diff --git a/4.0/packing.md b/4.0/packing.md new file mode 100644 index 0000000..9318f6e --- /dev/null +++ b/4.0/packing.md @@ -0,0 +1,76 @@ +--- +layout: default4 +category: advanced +title: Grammar Packing +--- + +Grammar packing refers to the process of taking a textual grammar output by [Thrax](thrax.html) and +efficiently encoding it for use by Joshua. Packing the grammar results in significantly faster load +times for very large grammars. + +Soon, the [Joshua pipeline script](pipeline.html) will add support for grammar packing +automatically, and we will provide a script that automates these steps for you. + +1. Make sure the grammar is labeled. A labeled grammar is one that has feature names attached to +each of the feature values in each row of the grammar file. Here is a line from an unlabeled +grammar: + + [X] ||| [X,1] ঠনà§à¦¯à¦¾à¦¨à§à¦¯ [X,2] ||| [X,1] other [X,2] ||| 0 0 1 0 0 1.02184 + + and here is one from an labeled grammar (note that the labels are not very useful): + + [X] ||| [X,1] ঠনà§à¦¯à¦¾à¦¨à§à¦¯ [X,2] ||| [X,1] other [X,2] ||| f1=0 f2=0 f3=1 f4=0 f5=0 f6=1.02184 + + If your grammar is not labeled, you can use the script `$JOSHUA/scripts/label_grammar.py`: + + zcat grammar.gz | $JOSHUA/scripts/label_grammar.py > grammar-labeled.gz + + As a side-effect of this step is to produce a file 'dense_map' in the current directory, + containing the mapping between feature names and feature columns. This file is needed in later + steps. + +1. The packer needs a sorted grammar. It is sufficient to sort by the first word: + + zcat grammar-labeled.gz | sort -k3,3 | gzip > grammar-sorted.gz + + (The reason we need a sorted grammar is because the packer stores the grammar in a trie. The + pieces can't be more than 2 GB due to Java limitations, so we need to ensure that rules are + grouped by the first arc in the trie to avoid redundancy across tries and to simplify the + lookup). + +1. In order to pack the grammar, we need two pieces of information: (1) a packer configuration file, + and (2) a dense map file. + + 1. Write a packer config file. This file specifies items such as the chunk size (for the packed + pieces) and the quantization classes and types for each feature name. Examples can be found + at + + $JOSHUA/test/packed/packer.config + $JOSHUA/test/bn-en/packed/packer.quantized + $JOSHUA/test/bn-en/packed/packer.uncompressed + + The quantizer lines in the packer config file have the following format: + + quantizer TYPE FEATURES + + where `TYPE` is one of `boolean`, `float`, `byte`, or `8bit`, and `FEATURES` is a + space-delimited list of feature names that have that quantization type. + + 1. Write a dense_map file. If you labeled an unlabeled grammar, this was produced for you as a + side product of the `label_grammar.py` script you called in Step 1. Otherwise, you need to + create a file that lists the mapping between feature names and (0-indexed) columns in the + grammar, one per line, in the following format: + + feature-index feature-name + +1. To pack the grammar, type the following command: + + java -cp $JOSHUA/bin joshua.tools.GrammarPacker -c PACKER_CONFIG_FILE -p OUTPUT_DIR -g GRAMMAR_FILE + + This will read in your packer configuration file and your grammar, and produced a packed grammar + in the output directory. + +1. To use the packed grammar, just point to the packed directory in your Joshua configuration file. + + tm-file = packed-grammar/ + tm-format = packed http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/pipeline.md ---------------------------------------------------------------------- diff --git a/4.0/pipeline.md b/4.0/pipeline.md new file mode 100644 index 0000000..33eafb3 --- /dev/null +++ b/4.0/pipeline.md @@ -0,0 +1,576 @@ +--- +layout: default4 +category: links +title: The Joshua Pipeline +--- + +This page describes the Joshua pipeline script, which manages the complexity of training and +evaluating machine translation systems. The pipeline eases the pain of two related tasks in +statistical machine translation (SMT) research: + +1. Training SMT systems involves a complicated process of interacting steps that are time-consuming +and prone to failure. + +1. Developing and testing new techniques requires varying parameters at different points in the +pipeline. Earlier results (which are often expensive) need not be recomputed. + +To facilitate these tasks, the pipeline script: +- Runs the complete SMT pipeline, from corpus normalization and tokenization, through model + building, tuning, test-set decoding, and evaluation. + +- Caches the results of intermediate steps (using robust SHA-1 checksums on dependencies), so the + pipeline can be debugged or shared across similar runs with (almost) no time spent recomputing + expensive steps. + +- Allows you to jump into and out of the pipeline at a set of predefined places (e.g., the alignment + stage), so long as you provide the missing dependencies. + +The Joshua pipeline script is designed in the spirit of Moses' `train-model.pl`, and shares many of +its features. It is not as extensive, however, as Moses' +[Experiment Management System](http://www.statmt.org/moses/?n=FactoredTraining.EMS). + +## Installation + +The pipeline has no *required* external dependencies. However, it has support for a number of +external packages, some of which are included with Joshua. + +- [GIZA++](http://code.google.com/p/giza-pp/) + + GIZA++ is the default aligner. It is included with Joshua, and should compile successfully when + you typed `ant all` from the Joshua root directory. It is not required because you can use the + (included) Berkeley aligner (`--aligner berkeley`). + +- [SRILM](http://www.speech.sri.com/projects/srilm/) + + By default, the pipeline uses a Java program from the + [Berkeley LM](http://code.google.com/p/berkeleylm/) package that constructs an + Kneser-Ney-smoothed language model in ARPA format from the target side of your training data. If + you wish to use SRILM instead, you need to do the following: + + 1. Install SRILM and set the `$SRILM` environment variable to point to its installed location. + 1. Add the `--lm-gen srilm` flag to your pipeline invocation. + + More information on this is available in the [LM building section of the pipeline](#lm). SRILM + is not used for representing language models during decoding (and in fact is not supported, + having been supplanted by [KenLM](http://kheafield.com/code/kenlm/) and BerkeleyLM). + +- [Hadoop](http://hadoop.apache.org/) + + The pipeline uses the [Thrax grammar extractor](thrax.html), which is built on Hadoop. If you + have a Hadoop installation, simply ensure that the `$HADOOP` environment variable is defined, and + the pipeline will use it automatically at the grammar extraction step. If you are going to + attempt to extract very large grammars, it is best to have a good-sized Hadoop installation. + + (If you do not have a Hadoop installation, you might consider setting one up. Hadoop can be + installed in a + ["pseudo-distributed"](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed) + mode that allows it to use just a few machines or a number of processors on a single machine. + The main issue is to ensure that there are a lot of independent physical disks, since in our + experience Hadoop starts to exhibit lots of hard-to-trace problems if there is too much demand on + the disks.) + + If you don't have a Hadoop installation, there are still no worries. The pipeline will unroll a + standalone installation and use it to extract your grammar. This behavior will be triggered if + `$HADOOP` is undefined. + +Make sure that the environment variable `$JOSHUA` is defined, and you should be all set. + +## A basic pipeline run + +The pipeline takes a set of inputs (training, tuning, and test data), and creates a set of +intermediate files in the *run directory*. By default, the run directory is the current directory, +but it can be changed with the `--rundir` parameter. + +For this quick start, we will be working with the example that can be found in +`$JOSHUA/examples/pipeline`. This example contains 1,000 sentences of Urdu-English data (the full +dataset is available as part of the +[Indian languages parallel corpora](http://joshua-decoder.org/indian-parallel-corpora/) with +100-sentence tuning and test sets with four references each. + +Running the pipeline requires two main steps: data preparation and invocation. + +1. Prepare your data. The pipeline script needs to be told where to find the raw training, tuning, + and test data. A good convention is to place these files in an input/ subdirectory of your run's + working directory (NOTE: do not use `data/`, since a directory of that name is created and used + by the pipeline itself). The expected format (for each of training, tuning, and test) is a pair + of files that share a common path prefix and are distinguished by their extension: + + input/ + train.SOURCE + train.TARGET + tune.SOURCE + tune.TARGET + test.SOURCE + test.TARGET + + These files should be parallel at the sentence level (with one sentence per line), should be in + UTF-8, and should be untokenized (tokenization occurs in the pipeline). SOURCE and TARGET denote + variables that should be replaced with the actual target and source language abbreviations (e.g., + "ur" and "en"). + +1. Run the pipeline. The following is the minimal invocation to run the complete pipeline: + + $JOSHUA/scripts/training/pipeline.pl \ + --corpus input/train \ + --tune input/tune \ + --test input/devtest \ + --source SOURCE \ + --target TARGET + + The `--corpus`, `--tune`, and `--test` flags define file prefixes that are concatened with the + language extensions given by `--target` and `--source` (with a "." in betwee). Note the + correspondences with the files defined in the first step above. The prefixes can be either + absolute or relative pathnames. This particular invocation assumes that a subdirectory `input/` + exists in the current directory, that you are translating from a language identified "ur" + extension to a language identified by the "en" extension, that the training data can be found at + `input/train.en` and `input/train.ur`, and so on. + +Assuming no problems arise, this command will run the complete pipeline in about 20 minutes, +producing BLEU scores at the end. As it runs, you will see output that looks like the following: + + [train-copy-en] rebuilding... + dep=/Users/post/code/joshua/test/pipeline/input/train.en + dep=data/train/train.en.gz [NOT FOUND] + cmd=cat /Users/post/code/joshua/test/pipeline/input/train.en | gzip -9n > data/train/train.en.gz + took 0 seconds (0s) + [train-copy-ur] rebuilding... + dep=/Users/post/code/joshua/test/pipeline/input/train.ur + dep=data/train/train.ur.gz [NOT FOUND] + cmd=cat /Users/post/code/joshua/test/pipeline/input/train.ur | gzip -9n > data/train/train.ur.gz + took 0 seconds (0s) + ... + +And in the current directory, you will see the following files (among other intermediate files +generated by the individual sub-steps). + + data/ + train/ + corpus.ur + corpus.en + thrax-input-file + tune/ + tune.tok.lc.ur + tune.tok.lc.en + grammar.filtered.gz + grammar.glue + test/ + test.tok.lc.ur + test.tok.lc.en + grammar.filtered.gz + grammar.glue + alignments/ + 0/ + [berkeley aligner output files] + training.align + thrax-hiero.conf + thrax.log + grammar.gz + lm.gz + tune/ + 1/ + decoder_command + joshua.config + params.txt + joshua.log + mert.log + joshua.config.ZMERT.final + final-bleu + +These files will be described in more detail in subsequent sections of this tutorial. + +Another useful flag is the `--rundir DIR` flag, which chdir()s to the specified directory before +running the pipeline. By default the rundir is the current directory. Changing it can be useful +for organizing related pipeline runs. Relative paths specified to other flags (e.g., to `--corpus` +or `--lmfile`) are relative to the directory the pipeline was called *from*, not the rundir itself +(unless they happen to be the same, of course). + +The complete pipeline comprises many tens of small steps, which can be grouped together into a set +of traditional pipeline tasks: + +1. [Data preparation](#prep) +1. [Alignment](#alignment) +1. [Parsing](#parsing) +1. [Grammar extraction](#tm) +1. [Language model building](#lm) +1. [Tuning](#tuning) +1. [Testing](#testing) + +These steps are discussed below, after a few intervening sections about high-level details of the +pipeline. + +## Grammar options + +Joshua can extract two types of grammars: Hiero-style grammars and SAMT grammars. As described on +the [file formats page](file-formats.html), both of them are encoded into the same file format, but +they differ in terms of the richness of their nonterminal sets. + +Hiero grammars make use of a single nonterminals, and are extracted by computing phrases from +word-based alignments and then subtracting out phrase differences. More detail can be found in +[Chiang (2007) [PDF]](http://www.mitpressjournals.org/doi/abs/10.1162/coli.2007.33.2.201). +[SAMT grammars](http://www.cs.cmu.edu/~zollmann/samt/) make use of a source- or target-side parse +tree on the training data, projecting constituent labels down on the phrasal alignments in a variety +of configurations. SAMT grammars are usually many times larger and are much slower to decode with, +but sometimes increase BLEU score. Both grammar formats are extracted with the +[Thrax software](thrax.html). + +By default, the Joshua pipeline extract a Hiero grammar, but this can be altered with the `--type +samt` flag. + +## Other high-level options + +The following command-line arguments control run-time behavior of multiple steps: + +- `--threads N` (1) + + This enables multithreaded operation for a number of steps: alignment (with GIZA, max two + threads), parsing, and decoding (any number of threads) + +- `--jobs N` (1) + + This enables parallel operation over a cluster using the qsub command. This feature is not + well-documented at this point, but you will likely want to edit the file + `$JOSHUA/scripts/training/parallelize/LocalConfig.pm` to setup your qsub environment, and may also + want to pass specific qsub commands via the `--qsub-args "ARGS"` command. + +## Restarting failed runs + +If the pipeline dies, you can restart it with the same command you used the first time. If you +rerun the pipeline with the exact same invocation as the previous run (or an overlapping +configuration -- one that causes the same set of behaviors), you will see slightly different +output compared to what we saw above: + + [train-copy-en] cached, skipping... + [train-copy-ur] cached, skipping... + ... + +This indicates that the caching module has discovered that the step was already computed and thus +did not need to be rerun. This feature is quite useful for restarting pipeline runs that have +crashed due to bugs, memory limitations, hardware failures, and the myriad other problems that +plague MT researchers across the world. + +Often, a command will die because it was parameterized incorrectly. For example, perhaps the +decoder ran out of memory. This allows you to adjust the parameter (e.g., `--joshua-mem`) and rerun +the script. Of course, if you change one of the parameters a step depends on, it will trigger a +rerun, which in turn might trigger further downstream reruns. + +## Skipping steps, quitting early + +You will also find it useful to start the pipeline somewhere other than data preparation (for +example, if you have already-processed data and an alignment, and want to begin with building a +grammar) or to end it prematurely (if, say, you don't have a test set and just want to tune a +model). This can be accomplished with the `--first-step` and `--last-step` flags, which take as +argument a case-insensitive version of the following steps: + +- *FIRST*: Data preparation. Everything begins with data preparation. This is the default first + step, so there is no need to be explicit about it. + +- *ALIGN*: Alignment. You might want to start here if you want to skip data preprocessing. + +- *PARSE*: Parsing. This is only relevant for building SAMT grammars (`--type samt`), in which case + the target side (`--target`) of the training data (`--corpus`) is parsed before building a + grammar. + +- *THRAX*: Grammar extraction [with Thrax](thrax.html). If you jump to this step, you'll need to + provide an aligned corpus (`--alignment`) along with your parallel data. + +- *TUNE*: Tuning. The exact tuning method is determined with `--tuner {mert,pro}`. With this + option, you need to specify a grammar (`--grammar`) or separate tune (`--tune-grammar`) and test + (`--test-grammar`) grammars. A full grammar (`--grammar`) will be filtered against the relevant + tuning or test set unless you specify `--no-filter-tm`. If you want a language model built from + the target side of your training data, you'll also need to pass in the training corpus + (`--corpus`). You can also specify an arbitrary number of additional language models with one or + more `--lmfile` flags. + +- *TEST*: Testing. If you have a tuned model file, you can test new corpora by passing in a test + corpus with references (`--test`). You'll need to provide a run name (`--name`) to store the + results of this run, which will be placed under `test/NAME`. You'll also need to provide a + Joshua configuration file (`--joshua-config`), one or more language models (`--lmfile`), and a + grammar (`--grammar`); this will be filtered to the test data unless you specify + `--no-filter-tm`) or unless you directly provide a filtered test grammar (`--test-grammar`). + +- *LAST*: The last step. This is the default target of `--last-step`. + +We now discuss these steps in more detail. + +<a name="prep" /> +## 1. DATA PREPARATION + +Data prepare involves doing the following to each of the training data (`--corpus`), tuning data +(`--tune`), and testing data (`--test`). Each of these values is an absolute or relative path +prefix. To each of these prefixes, a "." is appended, followed by each of SOURCE (`--source`) and +TARGET (`--target`), which are file extensions identifying the languages. The SOURCE and TARGET +files must have the same number of lines. + +For tuning and test data, multiple references are handled automatically. A single reference will +have the format TUNE.TARGET, while multiple references will have the format TUNE.TARGET.NUM, where +NUM starts at 0 and increments for as many references as there are. + +The following processing steps are applied to each file. + +1. **Copying** the files into `RUNDIR/data/TYPE`, where TYPE is one of "train", "tune", or "test". + Multiple `--corpora` files are concatenated in the order they are specified. Multiple `--tune` + and `--test` flags are not currently allowed. + +1. **Normalizing** punctuation and text (e.g., removing extra spaces, converting special + quotations). There are a few language-specific options that depend on the file extension + matching the [two-letter ISO 639-1](http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) + designation. + +1. **Tokenizing** the data (e.g., separating out punctuation, converting brackets). Again, there + are language-specific tokenizations for a few languages (English, German, and Greek). + +1. (Training only) **Removing** all parallel sentences with more than `--maxlen` tokens on either + side. By default, MAXLEN is 50. To turn this off, specify `--maxlen 0`. + +1. **Lowercasing**. + +This creates a series of intermediate files which are saved for posterity but compressed. For +example, you might see + + data/ + train/ + train.en.gz + train.tok.en.gz + train.tok.50.en.gz + train.tok.50.lc.en + corpus.en -> train.tok.50.lc.en + +The file "corpus.LANG" is a symbolic link to the last file in the chain. + +<a name="alignment" /> +## 2. ALIGNMENT + +Alignments are between the parallel corpora at `RUNDIR/data/train/corpus.{SOURCE,TARGET}`. To +prevent the alignment tables from getting too big, the parallel corpora are grouped into files of no +more than ALIGNER\_CHUNK\_SIZE blocks (controlled with a parameter below). The last block is folded +into the penultimate block if it is too small. These chunked files are all created in a +subdirectory of `RUNDIR/data/train/splits`, named `corpus.LANG.0`, `corpus.LANG.1`, and so on. + +The pipeline parameters affecting alignment are: + +- `aligner ALIGNER` {giza (default), berkeley} + + Which aligner to use. The default is [GIZA++](http://code.google.com/p/giza-pp/), but + [the Berkeley aligner](http://code.google.com/p/berkeleyaligner/) can be used instead. When + using the Berkeley aligner, you'll want to pay attention to how much memory you allocate to it + with `--aligner-mem` (the default is 10g). + +- `aligner-chunk-size SIZE` (1,000,000) + + The number of sentence pairs to compute alignments over. + +- `--alignment FILE` + + If you have an already-computed alignment, you can pass that to the script using this flag. + Note that, in this case, you will want to skip data preparation and alignment using + `--first-step thrax` (the first step after alignment) and also to specify `--no-prepare-data` so + as not to retokenize the data and mess with your alignments. + + The alignment file format is the standard format where 0-indexed many-many alignment pairs for a + sentence are provided on a line, source language first, e.g., + + 0-0 0-1 1-2 1-7 ... + + This value is required if you start at the grammar extraction step. + +When alignment is complete, the alignment file can be found at `RUNDIR/alignments/training.align`. +It is parallel to the training corpora. There are many files in the `alignments/` subdirectory that +contain the output of intermediate steps. + +<a name="parsing" /> +## 3. PARSING + +When SAMT grammars are being built (`--type samt`), the target side of the training data must be +parsed. The pipeline assumes your target side will be English, and will parse it for you using +[the Berkeley parser](http://code.google.com/p/berkeleyparser/), which is included. If it is not +the case that English is your target-side language, the target side of your training data (found at +CORPUS.TARGET) must already be parsed in PTB format. The pipeline will notice that it is parsed and +will not reparse it. + +Parsing is affected by both the `--threads N` and `--jobs N` options. The former runs the parser in +multithreaded mode, while the latter distributes the runs across as cluster (and requires some +configuration, not yet documented). The options are mutually exclusive. + +Once the parsing is complete, there will be two parsed files: + +- `RUNDIR/data/train/corpus.en.parsed`: this is the mixed-case file that was parsed. +- `RUNDIR/data/train/corpus.parsed.en`: this is a leaf-lowercased version of the above file used for + grammar extraction. + +<a name="tm" /> +## 4. THRAX (grammar extraction) + +The grammar extraction step takes three pieces of data: (1) the source-language training corpus, (2) +the target-language training corpus (parsed, if an SAMT grammar is being extracted), and (3) the +alignment file. From these, it computes a synchronous context-free grammar. If you already have a +grammar and wish to skip this step, you can do so passing the grammar with the `--grammar GRAMMAR` +flag. + +The main variable in grammar extraction is Hadoop. If you have a Hadoop installation, simply ensure +that the environment variable `$HADOOP` is defined, and Thrax will seamlessly use it. If you *do +not* have a Hadoop installation, the pipeline will roll out out for you, running Hadoop in +standalone mode. (This mode is triggered when `$HADOOP` is undefined). Theoretically, any grammar extractable on a full Hadoop cluster should be +extractable in standalone mode, if you are patient enough; in practice, you probably are not patient +enough, and will be limited to smaller datasets. Setting up your own Hadoop cluster is not too +difficult a chore; in particular, you may find it helpful to install a +[pseudo-distributed version of Hadoop](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html). +In our experience, this works fine, but you should note the following caveats: + +- It is of crucial importance that you have enough physical disks. We have found that having too + few, or too slow of disks, results in a whole host of seemingly unrelated issues that are hard to + resolve, such as timeouts. +- NFS filesystems can exacerbate this. You should really try to install physical disks that are + dedicated to Hadoop scratch space. + +Here are some flags relevant to Hadoop and grammar extraction with Thrax: + +- `--hadoop /path/to/hadoop` + + This sets the location of Hadoop (overriding the environment variable `$HADOOP`) + +- `--hadoop-mem MEM` (2g) + + This alters the amount of memory available to Hadoop mappers (passed via the + `mapred.child.java.opts` options). + +- `--thrax-conf FILE` + + Use the provided Thrax configuration file instead of the (grammar-specific) default. The Thrax + templates are located at `$JOSHUA/scripts/training/templates/thrax-TYPE.conf`, where TYPE is one + of "hiero" or "samt". + +When the grammar is extracted, it is compressed and placed at `RUNDIR/grammar.gz`. + +<a name="lm" /> +## 5. Language model + +Before tuning can take place, a language model is needed. A language model is always built from the +target side of the training corpus unless `--no-corpus-lm` is specified. In addition, you can +provide other language models (any number of them) with the `--lmfile FILE` argument. Other +arguments are as follows. + +- `--lm` {kenlm (default), berkeleylm} + + This determines the language model code that will be used when decoding. These implementations + are described in their respective papers (PDFs: + [KenLM](http://kheafield.com/professional/avenue/kenlm.pdf), + [BerkeleyLM](http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf)). + +- `--lmfile FILE` + + Specifies a pre-built language model to use when decoding. This language model can be in ARPA + format, or in KenLM format when using KenLM or BerkeleyLM format when using that format. + +- `--lm-gen` {berkeleylm (default), srilm}, `--buildlm-mem MEM`, `--witten-bell` + + At the tuning step, an LM is built from the target side of the training data (unless + `--no-corpus-lm` is specified). This controls which code is used to build it. The default is a + [BerkeleyLM java class](http://code.google.com/p/berkeleylm/source/browse/trunk/src/edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText.java) + that computes a Kneser-Ney LM with a constant discounting and no count thresholding. The flag + `--buildlm-mem` can be used to control how much memory is allocated to the Java process. The + default is "2g", but you will want to increase it for larger language models. + + If SRILM is used, it is called with the following arguments: + + $SRILM/bin/i686-m64/ngram-count -interpolate SMOOTHING -order 5 -text TRAINING-DATA -unk -lm lm.gz + + Where SMOOTHING is `-kndiscount`, or `-wbdiscount` if `--witten-bell` is passed to the pipeline. + +A language model built from the target side of the training data is placed at `RUNDIR/lm.gz`. + + +## Interlude: decoder arguments + +Running the decoder is done in both the tuning stage and the testing stage. A critical point is +that you have to give the decoder enough memory to run. Joshua can be very memory-intensive, in +particular when decoding with large grammars and large language models. The default amount of +memory is 3100m, which is likely not enough (especially if you are decoding with SAMT grammar). You +can alter the amount of memory for Joshua using the `--joshua-mem MEM` argument, where MEM is a Java +memory specification (passed to its `-Xmx` flag). + +<a name="tuning" /> +## 6. TUNING + +Two optimizers are implemented for Joshua: MERT and PRO (`--tuner {mert,pro}`). Tuning is run till +convergence in the `RUNDIR/tune` directory. By default, tuning is run just once, but the pipeline +supports running the optimizer an arbitrary number of times due to +[recent work](http://www.youtube.com/watch?v=BOa3XDkgf0Y) pointing out the variance of tuning +procedures in machine translation, in particular MERT. This can be activated with `--optimizer-runs +N`. Each run can be found in a directory `RUNDIR/tune/N`. + +When +tuning is finished, each final configuration file can be found at either + + RUNDIR/tune/N/joshua.config.ZMERT.final + RUNDIR/tune/N/joshua.config.PRO.final + +where N varies from 1..`--optimizer-runs`. + +<a name="testing" /> +## 7. Testing + +For each of the tuner runs, Joshua takes the tuner output file and decodes the test set. +Afterwards, by default, minimum Bayes-risk decoding is run on the 300-best output. This step +usually yields about 0.3 - 0.5 BLEU points but is time-consuming, and can be turned off with the +`--no-mbr` flag. + +After decoding the test set with each set of tuned weights, Joshua computes the mean BLEU score, +writes it to `RUNDIR/test/final-bleu`, and cats it. That's the end of the pipeline! + +Joshua also supports decoding further test sets. This is enabled by rerunning the pipeline with a +number of arguments: + +- `--first-step TEST` + + This tells the decoder to start at the test step. + +- `--name NAME` + + A name is needed to distinguish this test set from the previous ones. Output for this test run + will be stored at `RUNDIR/test/NAME`. + +- `--joshua-config CONFIG` + + A tuned parameter file is required. This file will be the output of some prior tuning run. + Necessary pathnames and so on will be adjusted. + +## COMMON USE CASES AND PITFALLS + +- If the pipeline dies at the "thrax-run" stage with an error like the following: + + JOB FAILED (return code 1) + hadoop/bin/hadoop: line 47: + /some/path/to/a/directory/hadoop/bin/hadoop-config.sh: No such file or directory + Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FsShell + Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FsShell + + This occurs if the `$HADOOP` environment variable is set but does not point to a working + Hadoop installation. To fix it, make sure to unset the variable: + + # in bash + unset HADOOP + + and then rerun the pipeline with the same invocation. + +- Memory usage is a major consideration in decoding with Joshua and hierarchical grammars. In + particular, SAMT grammars often require a large amount of memory. Many steps have been taken to + reduce memory usage, including beam settings and test-set- and sentence-level filtering of + grammars. However, memory usage can still be in the tens of gigabytes. + + To accommodate this kind of variation, the pipeline script allows you to specify both (a) the + amount of memory used by the Joshua decoder instance and (b) the amount of memory required of + nodes obtained by the qsub command. These are accomplished with the `--joshua-mem` MEM and + `--qsub-args` ARGS commands. For example, + + pipeline.pl --joshua-mem 32g --qsub-args "-l pvmem=32g -q himem.q" ... + + Also, should Thrax fail, it might be due to a memory restriction. By default, Thrax requests 2 GB + from the Hadoop server. If more memory is needed, set the memory requirement with the + `--hadoop-mem` in the same way as the `--joshua-mem` option is used. + +- Other pitfalls and advice will be added as it is discovered. + +## FEEDBACK + +Please email [email protected] with problems or suggestions. +
