http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/large-lms.md ---------------------------------------------------------------------- diff --git a/5.0/large-lms.md b/5.0/large-lms.md new file mode 100644 index 0000000..28ba0b9 --- /dev/null +++ b/5.0/large-lms.md @@ -0,0 +1,192 @@ +--- +layout: default +title: Building large LMs with SRILM +category: advanced +--- + +The following is a tutorial for building a large language model from the +English Gigaword Fifth Edition corpus +[LDC2011T07](http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07) +using SRILM. English text is provided from seven different sources. + +### Step 0: Clean up the corpus + +The Gigaword corpus has to be stripped of all SGML tags and tokenized. +Instructions for performing those steps are not included in this +documentation. A description of this process can be found in a paper +called ["Annotated +Gigaword"](https://akbcwekex2012.files.wordpress.com/2012/05/28_paper.pdf). + +The Joshua package ships with a script that converts all alphabetical +characters to their lowercase equivalent. The script is located at +`$JOSHUA/scripts/lowercase.perl`. + +Make a directory structure as follows: + + gigaword/ + âââ corpus/ + â  âââ afp_eng/ + â  â  âââ afp_eng_199405.lc.gz + â  â  âââ afp_eng_199406.lc.gz + â  â  âââ ... + â  â  âââ counts/ + â  âââ apw_eng/ + â  â  âââ apw_eng_199411.lc.gz + â  â  âââ apw_eng_199412.lc.gz + â  â  âââ ... + â  â  âââ counts/ + â  âââ cna_eng/ + â  â  âââ ... + â  â  âââ counts/ + â  âââ ltw_eng/ + â  â  âââ ... + â  â  âââ counts/ + â  âââ nyt_eng/ + â  â  âââ ... + â  â  âââ counts/ + â  âââ wpb_eng/ + â  â  âââ ... + â  â  âââ counts/ + â  âââ xin_eng/ + â    âââ ... + â    âââ counts/ + âââ lm/ +   âââ afp_eng/ +   âââ apw_eng/ +   âââ cna_eng/ +   âââ ltw_eng/ +   âââ nyt_eng/ +   âââ wpb_eng/ +   âââ xin_eng/ + + +The next step will be to build smaller LMs and then interpolate them into one +file. + +### Step 1: Count ngrams + +Run the following script once from each source directory under the `corpus/` +directory (edit it to specify the path to the `ngram-count` binary as well as +the number of processors): + + #!/bin/sh + + NGRAM_COUNT=$SRILM_SRC/bin/i686-m64/ngram-count + args="" + + for source in *.gz; do + args=$args"-sort -order 5 -text $source -write counts/$source-counts.gz " + done + + echo $args | xargs --max-procs=4 -n 7 $NGRAM_COUNT + +Then move each `counts/` directory to the corresponding directory under +`lm/`. Now that each ngram has been counted, we can make a language +model for each of the seven sources. + +### Step 2: Make individual language models + +SRILM includes a script, called `make-big-lm`, for building large language +models under resource-limited environments. The manual for this script can be +read online +[here](http://www-speech.sri.com/projects/srilm/manpages/training-scripts.1.html). +Since the Gigaword corpus is so large, it is convenient to use `make-big-lm` +even in environments with many parallel processors and a lot of memory. + +Initiate the following script from each of the source directories under the +`lm/` directory (edit it to specify the path to the `make-big-lm` script as +well as the pruning threshold): + + #!/bin/bash + set -x + + CMD=$SRILM_SRC/bin/make-big-lm + PRUNE_THRESHOLD=1e-8 + + $CMD \ + -name gigalm `for k in counts/*.gz; do echo " \ + -read $k "; done` \ + -lm lm.gz \ + -max-per-file 100000000 \ + -order 5 \ + -kndiscount \ + -interpolate \ + -unk \ + -prune $PRUNE_THRESHOLD + +The language model attributes chosen are the following: + +* N-grams up to order 5 +* Kneser-Ney smoothing +* N-gram probability estimates at the specified order *n* are interpolated with + lower-order estimates +* include the unknown-word token as a regular word +* pruning N-grams based on the specified threshold + +Next, we will mix the models together into a single file. + +### Step 3: Mix models together + +Using development text, interpolation weights can determined that give highest +weight to the source language models that have the lowest perplexity on the +specified development set. + +#### Step 3-1: Determine interpolation weights + +Initiate the following script from the `lm/` directory (edit it to specify the +path to the `ngram` binary as well as the path to the development text file): + + #!/bin/bash + set -x + + NGRAM=$SRILM_SRC/bin/i686-m64/ngram + DEV_TEXT=~mpost/expts/wmt12/runs/es-en/data/tune/tune.tok.lc.es + + dirs=( afp_eng apw_eng cna_eng ltw_eng nyt_eng wpb_eng xin_eng ) + + for d in ${dirs[@]} ; do + $NGRAM -debug 2 -order 5 -unk -lm $d/lm.gz -ppl $DEV_TEXT > $d/lm.ppl ; + done + + compute-best-mix */lm.ppl > best-mix.ppl + +Take a look at the contents of `best-mix.ppl`. It will contain a sequence of +values in parenthesis. These are the interpolation weights of the source +language models in the order specified. Copy and paste the values within the +parenthesis into the script below. + +#### Step 3-2: Combine the models + +Initiate the following script from the `lm/` directory (edit it to specify the +path to the `ngram` binary as well as the interpolation weights): + + #!/bin/bash + set -x + + NGRAM=$SRILM_SRC/bin/i686-m64/ngram + DIRS=( afp_eng apw_eng cna_eng ltw_eng nyt_eng wpb_eng xin_eng ) + LAMBDAS=(0.00631272 0.000647602 0.251555 0.0134726 0.348953 0.371566 0.00749238) + + $NGRAM -order 5 -unk \ + -lm ${DIRS[0]}/lm.gz -lambda ${LAMBDAS[0]} \ + -mix-lm ${DIRS[1]}/lm.gz \ + -mix-lm2 ${DIRS[2]}/lm.gz -mix-lambda2 ${LAMBDAS[2]} \ + -mix-lm3 ${DIRS[3]}/lm.gz -mix-lambda3 ${LAMBDAS[3]} \ + -mix-lm4 ${DIRS[4]}/lm.gz -mix-lambda4 ${LAMBDAS[4]} \ + -mix-lm5 ${DIRS[5]}/lm.gz -mix-lambda5 ${LAMBDAS[5]} \ + -mix-lm6 ${DIRS[6]}/lm.gz -mix-lambda6 ${LAMBDAS[6]} \ + -write-lm mixed_lm.gz + +The resulting file, `mixed_lm.gz` is a language model based on all the text in +the Gigaword corpus and with some probabilities biased to the development text +specify in step 3-1. It is in the ARPA format. The optional next step converts +it into KenLM format. + +#### Step 3-3: Convert to KenLM + +The KenLM format has some speed advantages over the ARPA format. Issuing the +following command will write a new language model file `mixed_lm-kenlm.gz` that +is the `mixed_lm.gz` language model transformed into the KenLM format. + + $JOSHUA/src/joshua/decoder/ff/lm/kenlm/build_binary mixed_lm.gz mixed_lm.kenlm +
http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/packing.md ---------------------------------------------------------------------- diff --git a/5.0/packing.md b/5.0/packing.md new file mode 100644 index 0000000..2f39ba7 --- /dev/null +++ b/5.0/packing.md @@ -0,0 +1,76 @@ +--- +layout: default +category: advanced +title: Grammar Packing +--- + +Grammar packing refers to the process of taking a textual grammar output by [Thrax](thrax.html) and +efficiently encoding it for use by Joshua. Packing the grammar results in significantly faster load +times for very large grammars. + +Soon, the [Joshua pipeline script](pipeline.html) will add support for grammar packing +automatically, and we will provide a script that automates these steps for you. + +1. Make sure the grammar is labeled. A labeled grammar is one that has feature names attached to +each of the feature values in each row of the grammar file. Here is a line from an unlabeled +grammar: + + [X] ||| [X,1] ঠনà§à¦¯à¦¾à¦¨à§à¦¯ [X,2] ||| [X,1] other [X,2] ||| 0 0 1 0 0 1.02184 + + and here is one from an labeled grammar (note that the labels are not very useful): + + [X] ||| [X,1] ঠনà§à¦¯à¦¾à¦¨à§à¦¯ [X,2] ||| [X,1] other [X,2] ||| f1=0 f2=0 f3=1 f4=0 f5=0 f6=1.02184 + + If your grammar is not labeled, you can use the script `$JOSHUA/scripts/label_grammar.py`: + + zcat grammar.gz | $JOSHUA/scripts/label_grammar.py > grammar-labeled.gz + + As a side-effect of this step is to produce a file 'dense_map' in the current directory, + containing the mapping between feature names and feature columns. This file is needed in later + steps. + +1. The packer needs a sorted grammar. It is sufficient to sort by the first word: + + zcat grammar-labeled.gz | sort -k3,3 | gzip > grammar-sorted.gz + + (The reason we need a sorted grammar is because the packer stores the grammar in a trie. The + pieces can't be more than 2 GB due to Java limitations, so we need to ensure that rules are + grouped by the first arc in the trie to avoid redundancy across tries and to simplify the + lookup). + +1. In order to pack the grammar, we need two pieces of information: (1) a packer configuration file, + and (2) a dense map file. + + 1. Write a packer config file. This file specifies items such as the chunk size (for the packed + pieces) and the quantization classes and types for each feature name. Examples can be found + at + + $JOSHUA/test/packed/packer.config + $JOSHUA/test/bn-en/packed/packer.quantized + $JOSHUA/test/bn-en/packed/packer.uncompressed + + The quantizer lines in the packer config file have the following format: + + quantizer TYPE FEATURES + + where `TYPE` is one of `boolean`, `float`, `byte`, or `8bit`, and `FEATURES` is a + space-delimited list of feature names that have that quantization type. + + 1. Write a dense_map file. If you labeled an unlabeled grammar, this was produced for you as a + side product of the `label_grammar.py` script you called in Step 1. Otherwise, you need to + create a file that lists the mapping between feature names and (0-indexed) columns in the + grammar, one per line, in the following format: + + feature-index feature-name + +1. To pack the grammar, type the following command: + + java -cp $JOSHUA/bin joshua.tools.GrammarPacker -c PACKER_CONFIG_FILE -p OUTPUT_DIR -g GRAMMAR_FILE + + This will read in your packer configuration file and your grammar, and produced a packed grammar + in the output directory. + +1. To use the packed grammar, just point to the packed directory in your Joshua configuration file. + + tm-file = packed-grammar/ + tm-format = packed http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/pipeline.md ---------------------------------------------------------------------- diff --git a/5.0/pipeline.md b/5.0/pipeline.md new file mode 100644 index 0000000..fbe052d --- /dev/null +++ b/5.0/pipeline.md @@ -0,0 +1,640 @@ +--- +layout: default +category: links +title: The Joshua Pipeline +--- + +This page describes the Joshua pipeline script, which manages the complexity of training and +evaluating machine translation systems. The pipeline eases the pain of two related tasks in +statistical machine translation (SMT) research: + +- Training SMT systems involves a complicated process of interacting steps that are + time-consuming and prone to failure. + +- Developing and testing new techniques requires varying parameters at different points in the + pipeline. Earlier results (which are often expensive) need not be recomputed. + +To facilitate these tasks, the pipeline script: + +- Runs the complete SMT pipeline, from corpus normalization and tokenization, through alignment, + model building, tuning, test-set decoding, and evaluation. + +- Caches the results of intermediate steps (using robust SHA-1 checksums on dependencies), so the + pipeline can be debugged or shared across similar runs while doing away with time spent + recomputing expensive steps. + +- Allows you to jump into and out of the pipeline at a set of predefined places (e.g., the alignment + stage), so long as you provide the missing dependencies. + +The Joshua pipeline script is designed in the spirit of Moses' `train-model.pl`, and shares many of +its features. It is not as extensive, however, as Moses' +[Experiment Management System](http://www.statmt.org/moses/?n=FactoredTraining.EMS), which allows +the user to define arbitrary execution dependency graphs. + +## Installation + +The pipeline has no *required* external dependencies. However, it has support for a number of +external packages, some of which are included with Joshua. + +- [GIZA++](http://code.google.com/p/giza-pp/) (included) + + GIZA++ is the default aligner. It is included with Joshua, and should compile successfully when + you typed `ant` from the Joshua root directory. It is not required because you can use the + (included) Berkeley aligner (`--aligner berkeley`). We have recently also provided support + for the [Jacana-XY aligner](http://code.google.com/p/jacana-xy/wiki/JacanaXY) (`--aligner + jacana`). + +- [Hadoop](http://hadoop.apache.org/) (included) + + The pipeline uses the [Thrax grammar extractor](thrax.html), which is built on Hadoop. If you + have a Hadoop installation, simply ensure that the `$HADOOP` environment variable is defined, and + the pipeline will use it automatically at the grammar extraction step. If you are going to + attempt to extract very large grammars, it is best to have a good-sized Hadoop installation. + + (If you do not have a Hadoop installation, you might consider setting one up. Hadoop can be + installed in a + ["pseudo-distributed"](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed) + mode that allows it to use just a few machines or a number of processors on a single machine. + The main issue is to ensure that there are a lot of independent physical disks, since in our + experience Hadoop starts to exhibit lots of hard-to-trace problems if there is too much demand on + the disks.) + + If you don't have a Hadoop installation, there are still no worries. The pipeline will unroll a + standalone installation and use it to extract your grammar. This behavior will be triggered if + `$HADOOP` is undefined. + +- [SRILM](http://www.speech.sri.com/projects/srilm/) (not included) + + By default, the pipeline uses a Java program from the + [Berkeley LM](http://code.google.com/p/berkeleylm/) package that constructs an + Kneser-Ney-smoothed language model in ARPA format from the target side of your training data. If + you wish to use SRILM instead, you need to do the following: + + 1. Install SRILM and set the `$SRILM` environment variable to point to its installed location. + 1. Add the `--lm-gen srilm` flag to your pipeline invocation. + + More information on this is available in the [LM building section of the pipeline](#lm). SRILM + is not used for representing language models during decoding (and in fact is not supported, + having been supplanted by [KenLM](http://kheafield.com/code/kenlm/) (the default) and + BerkeleyLM). + +- [Moses](http://statmt.org/moses/) (not included) + +Make sure that the environment variable `$JOSHUA` is defined, and you should be all set. + +## A basic pipeline run + +The pipeline takes a set of inputs (training, tuning, and test data), and creates a set of +intermediate files in the *run directory*. By default, the run directory is the current directory, +but it can be changed with the `--rundir` parameter. + +For this quick start, we will be working with the example that can be found in +`$JOSHUA/examples/pipeline`. This example contains 1,000 sentences of Urdu-English data (the full +dataset is available as part of the +[Indian languages parallel corpora](/indian-parallel-corpora/) with +100-sentence tuning and test sets with four references each. + +Running the pipeline requires two main steps: data preparation and invocation. + +1. Prepare your data. The pipeline script needs to be told where to find the raw training, tuning, + and test data. A good convention is to place these files in an input/ subdirectory of your run's + working directory (NOTE: do not use `data/`, since a directory of that name is created and used + by the pipeline itself for storing processed files). The expected format (for each of training, + tuning, and test) is a pair of files that share a common path prefix and are distinguished by + their extension, e.g., + + input/ + train.SOURCE + train.TARGET + tune.SOURCE + tune.TARGET + test.SOURCE + test.TARGET + + These files should be parallel at the sentence level (with one sentence per line), should be in + UTF-8, and should be untokenized (tokenization occurs in the pipeline). SOURCE and TARGET denote + variables that should be replaced with the actual target and source language abbreviations (e.g., + "ur" and "en"). + +1. Run the pipeline. The following is the minimal invocation to run the complete pipeline: + + $JOSHUA/bin/pipeline.pl \ + --corpus input/train \ + --tune input/tune \ + --test input/devtest \ + --source SOURCE \ + --target TARGET + + The `--corpus`, `--tune`, and `--test` flags define file prefixes that are concatened with the + language extensions given by `--target` and `--source` (with a "." in between). Note the + correspondences with the files defined in the first step above. The prefixes can be either + absolute or relative pathnames. This particular invocation assumes that a subdirectory `input/` + exists in the current directory, that you are translating from a language identified "ur" + extension to a language identified by the "en" extension, that the training data can be found at + `input/train.en` and `input/train.ur`, and so on. + +*Don't* run the pipeline directly from `$JOSHUA`. We recommend creating a run directory somewhere + else to contain all of your experiments in some other location. The advantage to this (apart from + not clobbering part of the Joshua install) is that Joshua provides support scripts for visualizing + the results of a series of experiments that only work if you + +Assuming no problems arise, this command will run the complete pipeline in about 20 minutes, +producing BLEU scores at the end. As it runs, you will see output that looks like the following: + + [train-copy-en] rebuilding... + dep=/Users/post/code/joshua/test/pipeline/input/train.en + dep=data/train/train.en.gz [NOT FOUND] + cmd=cat /Users/post/code/joshua/test/pipeline/input/train.en | gzip -9n > data/train/train.en.gz + took 0 seconds (0s) + [train-copy-ur] rebuilding... + dep=/Users/post/code/joshua/test/pipeline/input/train.ur + dep=data/train/train.ur.gz [NOT FOUND] + cmd=cat /Users/post/code/joshua/test/pipeline/input/train.ur | gzip -9n > data/train/train.ur.gz + took 0 seconds (0s) + ... + +And in the current directory, you will see the following files (among other intermediate files +generated by the individual sub-steps). + + data/ + train/ + corpus.ur + corpus.en + thrax-input-file + tune/ + tune.tok.lc.ur + tune.tok.lc.en + grammar.filtered.gz + grammar.glue + test/ + test.tok.lc.ur + test.tok.lc.en + grammar.filtered.gz + grammar.glue + alignments/ + 0/ + [giza/berkeley aligner output files] + training.align + thrax-hiero.conf + thrax.log + grammar.gz + lm.gz + tune/ + 1/ + decoder_command + joshua.config + params.txt + joshua.log + mert.log + joshua.config.ZMERT.final + final-bleu + +These files will be described in more detail in subsequent sections of this tutorial. + +Another useful flag is the `--rundir DIR` flag, which chdir()s to the specified directory before +running the pipeline. By default the rundir is the current directory. Changing it can be useful +for organizing related pipeline runs. Relative paths specified to other flags (e.g., to `--corpus` +or `--lmfile`) are relative to the directory the pipeline was called *from*, not the rundir itself +(unless they happen to be the same, of course). + +The complete pipeline comprises many tens of small steps, which can be grouped together into a set +of traditional pipeline tasks: + +1. [Data preparation](#prep) +1. [Alignment](#alignment) +1. [Parsing](#parsing) (syntax-based grammars only) +1. [Grammar extraction](#tm) +1. [Language model building](#lm) +1. [Tuning](#tuning) +1. [Testing](#testing) +1. [Analysis](#analysis) + +These steps are discussed below, after a few intervening sections about high-level details of the +pipeline. + +## Managing groups of experiments + +The real utility of the pipeline comes when you use it to manage groups of experiments. Typically, +there is a held-out test set, and we want to vary a number of training parameters to determine what +effect this has on BLEU scores or some other metric. Joshua comes with a script +`$JOSHUA/scripts/training/summarize.pl` that collects information from a group of runs and reports +them to you. This script works so long as you organize your runs as follows: + +1. Your runs should be grouped together in a root directory, which I'll call `$RUNDIR`. + +2. For comparison purposes, the runs should all be evaluated on the same test set. + +3. Each run in the run group should be in its own numbered directory, shown with the files used by +the summarize script: + + $RUNDIR/ + 1/ + README.txt + test/ + final-bleu + final-times + [other files] + 2/ + README.txt + ... + +You can get such directories using the `--rundir N` flag to the pipeline. + +Run directories can build off each other. For example, `1/` might contain a complete baseline +run. If you wanted to just change the tuner, you don't need to rerun the aligner and model builder, +so you can reuse the results by supplying the second run with the information it needs that was +computed in step 1: + + $JOSHUA/bin/pipeline.pl \ + --first-step tune \ + --grammar 1/grammar.gz \ + ... + +More details are below. + +## Grammar options + +Joshua can extract three types of grammars: Hiero grammars, GHKM, and SAMT grammars. As described +on the [file formats page](file-formats.html), all of them are encoded into the same file format, +but they differ in terms of the richness of their nonterminal sets. + +Hiero grammars make use of a single nonterminals, and are extracted by computing phrases from +word-based alignments and then subtracting out phrase differences. More detail can be found in +[Chiang (2007) [PDF]](http://www.mitpressjournals.org/doi/abs/10.1162/coli.2007.33.2.201). +[GHKM](http://www.isi.edu/%7Emarcu/papers/cr_ghkm_naacl04.pdf) (new with 5.0) and +[SAMT](http://www.cs.cmu.edu/~zollmann/samt/) grammars make use of a source- or target-side parse +tree on the training data, differing in the way they extract rules using these trees: GHKM extracts +synchronous tree substitution grammar rules rooted in a subset of the tree constituents, whereas +SAMT projects constituent labels down onto phrases. SAMT grammars are usually many times larger and +are much slower to decode with, but sometimes increase BLEU score. Both grammar formats are +extracted with the [Thrax software](thrax.html). + +By default, the Joshua pipeline extract a Hiero grammar, but this can be altered with the `--type +(ghkm|samt)` flag. For GHKM grammars, the default is to use +[Michel Galley's extractor](http://www-nlp.stanford.edu/~mgalley/software/stanford-ghkm-latest.tar.gz), +but you can also use Moses' extractor with `--ghkm-extractor moses`. Galley's extractor only outputs +two features, so the scores tend to be significantly lower than that of Moses'. + +## Other high-level options + +The following command-line arguments control run-time behavior of multiple steps: + +- `--threads N` (1) + + This enables multithreaded operation for a number of steps: alignment (with GIZA, max two + threads), parsing, and decoding (any number of threads) + +- `--jobs N` (1) + + This enables parallel operation over a cluster using the qsub command. This feature is not + well-documented at this point, but you will likely want to edit the file + `$JOSHUA/scripts/training/parallelize/LocalConfig.pm` to setup your qsub environment, and may also + want to pass specific qsub commands via the `--qsub-args "ARGS"` command. + +## Restarting failed runs + +If the pipeline dies, you can restart it with the same command you used the first time. If you +rerun the pipeline with the exact same invocation as the previous run (or an overlapping +configuration -- one that causes the same set of behaviors), you will see slightly different +output compared to what we saw above: + + [train-copy-en] cached, skipping... + [train-copy-ur] cached, skipping... + ... + +This indicates that the caching module has discovered that the step was already computed and thus +did not need to be rerun. This feature is quite useful for restarting pipeline runs that have +crashed due to bugs, memory limitations, hardware failures, and the myriad other problems that +plague MT researchers across the world. + +Often, a command will die because it was parameterized incorrectly. For example, perhaps the +decoder ran out of memory. This allows you to adjust the parameter (e.g., `--joshua-mem`) and rerun +the script. Of course, if you change one of the parameters a step depends on, it will trigger a +rerun, which in turn might trigger further downstream reruns. + +## <a id="steps" /> Skipping steps, quitting early + +You will also find it useful to start the pipeline somewhere other than data preparation (for +example, if you have already-processed data and an alignment, and want to begin with building a +grammar) or to end it prematurely (if, say, you don't have a test set and just want to tune a +model). This can be accomplished with the `--first-step` and `--last-step` flags, which take as +argument a case-insensitive version of the following steps: + +- *FIRST*: Data preparation. Everything begins with data preparation. This is the default first + step, so there is no need to be explicit about it. + +- *ALIGN*: Alignment. You might want to start here if you want to skip data preprocessing. + +- *PARSE*: Parsing. This is only relevant for building SAMT grammars (`--type samt`), in which case + the target side (`--target`) of the training data (`--corpus`) is parsed before building a + grammar. + +- *THRAX*: Grammar extraction [with Thrax](thrax.html). If you jump to this step, you'll need to + provide an aligned corpus (`--alignment`) along with your parallel data. + +- *TUNE*: Tuning. The exact tuning method is determined with `--tuner {mert,mira,pro}`. With this + option, you need to specify a grammar (`--grammar`) or separate tune (`--tune-grammar`) and test + (`--test-grammar`) grammars. A full grammar (`--grammar`) will be filtered against the relevant + tuning or test set unless you specify `--no-filter-tm`. If you want a language model built from + the target side of your training data, you'll also need to pass in the training corpus + (`--corpus`). You can also specify an arbitrary number of additional language models with one or + more `--lmfile` flags. + +- *TEST*: Testing. If you have a tuned model file, you can test new corpora by passing in a test + corpus with references (`--test`). You'll need to provide a run name (`--name`) to store the + results of this run, which will be placed under `test/NAME`. You'll also need to provide a + Joshua configuration file (`--joshua-config`), one or more language models (`--lmfile`), and a + grammar (`--grammar`); this will be filtered to the test data unless you specify + `--no-filter-tm`) or unless you directly provide a filtered test grammar (`--test-grammar`). + +- *LAST*: The last step. This is the default target of `--last-step`. + +We now discuss these steps in more detail. + +### <a id="prep" /> 1. DATA PREPARATION + +Data prepare involves doing the following to each of the training data (`--corpus`), tuning data +(`--tune`), and testing data (`--test`). Each of these values is an absolute or relative path +prefix. To each of these prefixes, a "." is appended, followed by each of SOURCE (`--source`) and +TARGET (`--target`), which are file extensions identifying the languages. The SOURCE and TARGET +files must have the same number of lines. + +For tuning and test data, multiple references are handled automatically. A single reference will +have the format TUNE.TARGET, while multiple references will have the format TUNE.TARGET.NUM, where +NUM starts at 0 and increments for as many references as there are. + +The following processing steps are applied to each file. + +1. **Copying** the files into `$RUNDIR/data/TYPE`, where TYPE is one of "train", "tune", or "test". + Multiple `--corpora` files are concatenated in the order they are specified. Multiple `--tune` + and `--test` flags are not currently allowed. + +1. **Normalizing** punctuation and text (e.g., removing extra spaces, converting special + quotations). There are a few language-specific options that depend on the file extension + matching the [two-letter ISO 639-1](http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) + designation. + +1. **Tokenizing** the data (e.g., separating out punctuation, converting brackets). Again, there + are language-specific tokenizations for a few languages (English, German, and Greek). + +1. (Training only) **Removing** all parallel sentences with more than `--maxlen` tokens on either + side. By default, MAXLEN is 50. To turn this off, specify `--maxlen 0`. + +1. **Lowercasing**. + +This creates a series of intermediate files which are saved for posterity but compressed. For +example, you might see + + data/ + train/ + train.en.gz + train.tok.en.gz + train.tok.50.en.gz + train.tok.50.lc.en + corpus.en -> train.tok.50.lc.en + +The file "corpus.LANG" is a symbolic link to the last file in the chain. + +## 2. ALIGNMENT <a id="alignment" /> + +Alignments are between the parallel corpora at `$RUNDIR/data/train/corpus.{SOURCE,TARGET}`. To +prevent the alignment tables from getting too big, the parallel corpora are grouped into files of no +more than ALIGNER\_CHUNK\_SIZE blocks (controlled with a parameter below). The last block is folded +into the penultimate block if it is too small. These chunked files are all created in a +subdirectory of `$RUNDIR/data/train/splits`, named `corpus.LANG.0`, `corpus.LANG.1`, and so on. + +The pipeline parameters affecting alignment are: + +- `--aligner ALIGNER` {giza (default), berkeley, jacana} + + Which aligner to use. The default is [GIZA++](http://code.google.com/p/giza-pp/), but + [the Berkeley aligner](http://code.google.com/p/berkeleyaligner/) can be used instead. When + using the Berkeley aligner, you'll want to pay attention to how much memory you allocate to it + with `--aligner-mem` (the default is 10g). + +- `--aligner-chunk-size SIZE` (1,000,000) + + The number of sentence pairs to compute alignments over. The training data is split into blocks + of this size, aligned separately, and then concatenated. + +- `--alignment FILE` + + If you have an already-computed alignment, you can pass that to the script using this flag. + Note that, in this case, you will want to skip data preparation and alignment using + `--first-step thrax` (the first step after alignment) and also to specify `--no-prepare` so + as not to retokenize the data and mess with your alignments. + + The alignment file format is the standard format where 0-indexed many-many alignment pairs for a + sentence are provided on a line, source language first, e.g., + + 0-0 0-1 1-2 1-7 ... + + This value is required if you start at the grammar extraction step. + +When alignment is complete, the alignment file can be found at `$RUNDIR/alignments/training.align`. +It is parallel to the training corpora. There are many files in the `alignments/` subdirectory that +contain the output of intermediate steps. + +### <a id="parsing" /> 3. PARSING + +To build SAMT and GHKM grammars (`--type samt` and `--type ghkm`), the target side of the +training data must be parsed. The pipeline assumes your target side will be English, and will parse +it for you using [the Berkeley parser](http://code.google.com/p/berkeleyparser/), which is included. +If it is not the case that English is your target-side language, the target side of your training +data (found at CORPUS.TARGET) must already be parsed in PTB format. The pipeline will notice that +it is parsed and will not reparse it. + +Parsing is affected by both the `--threads N` and `--jobs N` options. The former runs the parser in +multithreaded mode, while the latter distributes the runs across as cluster (and requires some +configuration, not yet documented). The options are mutually exclusive. + +Once the parsing is complete, there will be two parsed files: + +- `$RUNDIR/data/train/corpus.en.parsed`: this is the mixed-case file that was parsed. +- `$RUNDIR/data/train/corpus.parsed.en`: this is a leaf-lowercased version of the above file used for + grammar extraction. + +## 4. THRAX (grammar extraction) <a id="tm" /> + +The grammar extraction step takes three pieces of data: (1) the source-language training corpus, (2) +the target-language training corpus (parsed, if an SAMT grammar is being extracted), and (3) the +alignment file. From these, it computes a synchronous context-free grammar. If you already have a +grammar and wish to skip this step, you can do so passing the grammar with the `--grammar +/path/to/grammar` flag. + +The main variable in grammar extraction is Hadoop. If you have a Hadoop installation, simply ensure +that the environment variable `$HADOOP` is defined, and Thrax will seamlessly use it. If you *do +not* have a Hadoop installation, the pipeline will roll out out for you, running Hadoop in +standalone mode (this mode is triggered when `$HADOOP` is undefined). Theoretically, any grammar +extractable on a full Hadoop cluster should be extractable in standalone mode, if you are patient +enough; in practice, you probably are not patient enough, and will be limited to smaller +datasets. You may also run into problems with disk space; Hadoop uses a lot (use `--tmp +/path/to/tmp` to specify an alternate place for temporary data; we suggest you use a local disk +partition with tens or hundreds of gigabytes free, and not an NFS partition). Setting up your own +Hadoop cluster is not too difficult a chore; in particular, you may find it helpful to install a +[pseudo-distributed version of Hadoop](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html). +In our experience, this works fine, but you should note the following caveats: + +- It is of crucial importance that you have enough physical disks. We have found that having too + few, or too slow of disks, results in a whole host of seemingly unrelated issues that are hard to + resolve, such as timeouts. +- NFS filesystems can cause lots of problems. You should really try to install physical disks that + are dedicated to Hadoop scratch space. + +Here are some flags relevant to Hadoop and grammar extraction with Thrax: + +- `--hadoop /path/to/hadoop` + + This sets the location of Hadoop (overriding the environment variable `$HADOOP`) + +- `--hadoop-mem MEM` (2g) + + This alters the amount of memory available to Hadoop mappers (passed via the + `mapred.child.java.opts` options). + +- `--thrax-conf FILE` + + Use the provided Thrax configuration file instead of the (grammar-specific) default. The Thrax + templates are located at `$JOSHUA/scripts/training/templates/thrax-TYPE.conf`, where TYPE is one + of "hiero" or "samt". + +When the grammar is extracted, it is compressed and placed at `$RUNDIR/grammar.gz`. + +## <a id="lm" /> 5. Language model + +Before tuning can take place, a language model is needed. A language model is always built from the +target side of the training corpus unless `--no-corpus-lm` is specified. In addition, you can +provide other language models (any number of them) with the `--lmfile FILE` argument. Other +arguments are as follows. + +- `--lm` {kenlm (default), berkeleylm} + + This determines the language model code that will be used when decoding. These implementations + are described in their respective papers (PDFs: + [KenLM](http://kheafield.com/professional/avenue/kenlm.pdf), + [BerkeleyLM](http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf)). KenLM is written in + C++ and requires a pass through the JNI, but is recommended because it supports left-state minimization. + +- `--lmfile FILE` + + Specifies a pre-built language model to use when decoding. This language model can be in ARPA + format, or in KenLM format when using KenLM or BerkeleyLM format when using that format. + +- `--lm-gen` {kenlm (default), srilm, berkeleylm}, `--buildlm-mem MEM`, `--witten-bell` + + At the tuning step, an LM is built from the target side of the training data (unless + `--no-corpus-lm` is specified). This controls which code is used to build it. The default is a + KenLM's [lmplz](http://kheafield.com/code/kenlm/estimation/), and is strongly recommended. + + If SRILM is used, it is called with the following arguments: + + $SRILM/bin/i686-m64/ngram-count -interpolate SMOOTHING -order 5 -text TRAINING-DATA -unk -lm lm.gz + + Where SMOOTHING is `-kndiscount`, or `-wbdiscount` if `--witten-bell` is passed to the pipeline. + + [BerkeleyLM java class](http://code.google.com/p/berkeleylm/source/browse/trunk/src/edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText.java) + is also available. It computes a Kneser-Ney LM with a constant discounting (0.75) and no count + thresholding. The flag `--buildlm-mem` can be used to control how much memory is allocated to the + Java process. The default is "2g", but you will want to increase it for larger language models. + + A language model built from the target side of the training data is placed at `$RUNDIR/lm.gz`. + +## Interlude: decoder arguments + +Running the decoder is done in both the tuning stage and the testing stage. A critical point is +that you have to give the decoder enough memory to run. Joshua can be very memory-intensive, in +particular when decoding with large grammars and large language models. The default amount of +memory is 3100m, which is likely not enough (especially if you are decoding with SAMT grammar). You +can alter the amount of memory for Joshua using the `--joshua-mem MEM` argument, where MEM is a Java +memory specification (passed to its `-Xmx` flag). + +## <a id="tuning" /> 6. TUNING + +Two optimizers are provided with Joshua: MERT and PRO (`--tuner {mert,pro}`). If Moses is +installed, you can also use Cherry & Foster's k-best batch MIRA (`--tuner mira`, recommended). +Tuning is run till convergence in the `$RUNDIR/tune/N` directory, where N is the tuning instance. +By default, tuning is run just once, but the pipeline supports running the optimizer an arbitrary +number of times due to [recent work](http://www.youtube.com/watch?v=BOa3XDkgf0Y) pointing out the +variance of tuning procedures in machine translation, in particular MERT. This can be activated +with `--optimizer-runs N`. Each run can be found in a directory `$RUNDIR/tune/N`. + +When tuning is finished, each final configuration file can be found at either + + $RUNDIR/tune/N/joshua.config.final + +where N varies from 1..`--optimizer-runs`. + +## <a id="testing" /> 7. Testing + +For each of the tuner runs, Joshua takes the tuner output file and decodes the test set. If you +like, you can also apply minimum Bayes-risk decoding to the decoder output with `--mbr`. This +usually yields about 0.3 - 0.5 BLEU points, but is time-consuming. + +After decoding the test set with each set of tuned weights, Joshua computes the mean BLEU score, +writes it to `$RUNDIR/test/final-bleu`, and cats it. It also writes a file +`$RUNDIR/test/final-times` containing a summary of runtime information. That's the end of the pipeline! + +Joshua also supports decoding further test sets. This is enabled by rerunning the pipeline with a +number of arguments: + +- `--first-step TEST` + + This tells the decoder to start at the test step. + +- `--name NAME` + + A name is needed to distinguish this test set from the previous ones. Output for this test run + will be stored at `$RUNDIR/test/NAME`. + +- `--joshua-config CONFIG` + + A tuned parameter file is required. This file will be the output of some prior tuning run. + Necessary pathnames and so on will be adjusted. + +## <a id="analysis"> 8. ANALYSIS + +If you have used the suggested layout, with a number of related runs all contained in a common +directory with sequential numbers, you can use the script `$JOSHUA/scripts/training/summarize.pl` to +display a summary of the mean BLEU scores from all runs, along with the text you placed in the run +README file (using the pipeline's `--readme TEXT` flag). + +## COMMON USE CASES AND PITFALLS + +- If the pipeline dies at the "thrax-run" stage with an error like the following: + + JOB FAILED (return code 1) + hadoop/bin/hadoop: line 47: + /some/path/to/a/directory/hadoop/bin/hadoop-config.sh: No such file or directory + Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FsShell + Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FsShell + + This occurs if the `$HADOOP` environment variable is set but does not point to a working + Hadoop installation. To fix it, make sure to unset the variable: + + # in bash + unset HADOOP + + and then rerun the pipeline with the same invocation. + +- Memory usage is a major consideration in decoding with Joshua and hierarchical grammars. In + particular, SAMT grammars often require a large amount of memory. Many steps have been taken to + reduce memory usage, including beam settings and test-set- and sentence-level filtering of + grammars. However, memory usage can still be in the tens of gigabytes. + + To accommodate this kind of variation, the pipeline script allows you to specify both (a) the + amount of memory used by the Joshua decoder instance and (b) the amount of memory required of + nodes obtained by the qsub command. These are accomplished with the `--joshua-mem` MEM and + `--qsub-args` ARGS commands. For example, + + pipeline.pl --joshua-mem 32g --qsub-args "-l pvmem=32g -q himem.q" ... + + Also, should Thrax fail, it might be due to a memory restriction. By default, Thrax requests 2 GB + from the Hadoop server. If more memory is needed, set the memory requirement with the + `--hadoop-mem` in the same way as the `--joshua-mem` option is used. + +- Other pitfalls and advice will be added as it is discovered. + +## FEEDBACK + +Please email [email protected] with problems or suggestions. + http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/server.md ---------------------------------------------------------------------- diff --git a/5.0/server.md b/5.0/server.md new file mode 100644 index 0000000..52b2a66 --- /dev/null +++ b/5.0/server.md @@ -0,0 +1,30 @@ +--- +layout: default +category: links +title: Server mode +--- + +The Joshua decoder can be run as a TCP/IP server instead of a POSIX-style command-line tool. Clients can concurrently connect to a socket and receive a set of newline-separated outputs for a set of newline-separated inputs. + +Threading takes place both within and across requests. Threads from the decoder pool are assigned in round-robin manner across requests, preventing starvation. + + +# Invoking the server + +A running server is configured at invokation time. To start in server mode, run `joshua-decoder` with the option `-server-port [PORT]`. Additionally, the server can be configured in the same ways as when using the command-line-functionality. + +E.g., + + $JOSHUA/bin/joshua-decoder -server-port 10101 -mark-oovs false -output-format "%s" -threads 10 + +## Using the server + +To test that the server is working, a set of inputs can be sent to the server from the command line. + +The server, as configured in the example above, will then respond to requests on port 10101. You can test it out with the `nc` utility: + + wget -qO - http://cs.jhu.edu/~post/files/pg1023.txt | head -132 | tail -11 | nc localhost 10101 + +Since no model was loaded, this will just return the text to you as sent to the server. + +The `-server-port` option can also be used when creating a [bundled configuration](bundle.html) that will be run in server mode. http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/thrax.md ---------------------------------------------------------------------- diff --git a/5.0/thrax.md b/5.0/thrax.md new file mode 100644 index 0000000..a904b23 --- /dev/null +++ b/5.0/thrax.md @@ -0,0 +1,14 @@ +--- +layout: default +category: advanced +title: Grammar extraction with Thrax +--- + +One day, this will hold Thrax documentation, including how to use Thrax, how to do grammar +filtering, and details on the configuration file options. It will also include details about our +experience setting up and maintaining Hadoop cluster installations, knowledge wrought of hard-fought +sweat and tears. + +In the meantime, please bother [Jonny Weese](http://cs.jhu.edu/~jonny/) if there is something you +need to do that you don't understand. You might also be able to dig up some information [on the old +Thrax page](http://cs.jhu.edu/~jonny/thrax/). http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/tms.md ---------------------------------------------------------------------- diff --git a/5.0/tms.md b/5.0/tms.md new file mode 100644 index 0000000..68f8732 --- /dev/null +++ b/5.0/tms.md @@ -0,0 +1,106 @@ +--- +layout: default +category: advanced +title: Building Translation Models +--- + +# Build a translation model + +Extracting a grammar from a large amount of data is a multi-step process. The first requirement is parallel data. The Europarl, Call Home, and Fisher corpora all contain parallel translations of Spanish and English sentences. + +We will copy (or symlink) the parallel source text files in a subdirectory called `input/`. + +Then, we concatenate all the training files on each side. The pipeline script normally does tokenization and normalization, but in this instance we have a custom tokenizer we need to apply to the source side, so we have to do it manually and then skip that step using the `pipeline.pl` option `--first-step alignment`. + +* to tokenize the English data, do + + cat callhome.en europarl.en fisher.en > all.en | $JOSHUA/scripts/training/normalize-punctuation.pl en | $JOSHUA/scripts/training/penn-treebank-tokenizer.perl | $JOSHUA/scripts/lowercase.perl > all.norm.tok.lc.en + +The same can be done for the Spanish side of the input data: + + cat callhome.es europarl.es fisher.es > all.es | $JOSHUA/scripts/training/normalize-punctuation.pl es | $JOSHUA/scripts/training/penn-treebank-tokenizer.perl | $JOSHUA/scripts/lowercase.perl > all.norm.tok.lc.es + +By the way, an alternative tokenizer is a Twitter tokenizer found in the [Jerboa](http://github.com/vandurme/jerboa) project. + +The final step in the training data preparation is to remove all examples in which either of the language sides is a blank line. + + paste all.norm.tok.lc.es all.norm.tok.lc.en | grep -Pv "^\t|\t$" \ + | ./splittabs.pl all.norm.tok.lc.noblanks.es all.norm.tok.lc.noblanks.en + +contents of `splittabls.pl` by Matt Post: + + #!/usr/bin/perl + + # splits on tab, printing respective chunks to the list of files given + # as script arguments + + use FileHandle; + + my @fh; + $| = 1; # don't buffer output + + if (@ARGV < 0) { + print "Usage: splittabs.pl < tabbed-file\n"; + exit; + } + + my @fh = map { get_filehandle($_) } @ARGV; + @ARGV = (); + + while (my $line = <>) { + chomp($line); + my (@fields) = split(/\t/,$line,scalar @fh); + + map { print {$fh[$_]} "$fields[$_]\n" } (0..$#fields); + } + + sub get_filehandle { + my $file = shift; + + if ($file eq "-") { + return *STDOUT; + } else { + local *FH; + open FH, ">$file" or die "can't open '$file' for writing"; + return *FH; + } + } + +Now we can run the pipeline to extract the grammar. Run the following script: + + #!/bin/bash + + # this creates a grammar + + # NEED: + # pair + # type + + set -u + + pair=es-en + type=hiero + + #. ~/.bashrc + + #basedir=$(pwd) + + dir=grammar-$pair-$type + + [[ ! -d $dir ]] && mkdir -p $dir + cd $dir + + source=$(echo $pair | cut -d- -f 1) + target=$(echo $pair | cut -d- -f 2) + + $JOSHUA/scripts/training/pipeline.pl \ + --source $source \ + --target $target \ + --corpus /home/hltcoe/lorland/expts/scale12/model1/input/all.norm.tok.lc.noblanks \ + --type $type \ + --joshua-mem 100g \ + --no-prepare \ + --first-step align \ + --last-step thrax \ + --hadoop $HADOOP \ + --threads 8 \ http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/tutorial.md ---------------------------------------------------------------------- diff --git a/5.0/tutorial.md b/5.0/tutorial.md new file mode 100644 index 0000000..038db9f --- /dev/null +++ b/5.0/tutorial.md @@ -0,0 +1,174 @@ +--- +layout: default +category: links +title: Pipeline tutorial +--- + +This document will walk you through using the pipeline in a variety of scenarios. Once you've gained a +sense for how the pipeline works, you can consult the [pipeline page](pipeline.html) for a number of +other options available in the pipeline. + +## Download and Setup + +Download and install Joshua as described on the [quick start page](index.html), installing it under +`~/code/`. Once you've done that, you should make sure you have the following environment variable set: + + export JOSHUA=$HOME/code/joshua-v5.0 + export JAVA_HOME=/usr/java/default + +If you have a Hadoop installation, make sure you've set `$HADOOP` to point to it (if not, Joshua +will roll out a standalone cluster for you). If you'd like to use kbmira for tuning, you should also +install Moses, and define the environment variable `$MOSES` to point to the root of its installation. + +## A basic pipeline run + +For today's experiments, we'll be building a Bengali--English system using data included in the +[Indian Languages Parallel Corpora](/indian-parallel-corpora/). This data was collected by taking +the 100 most-popular Bengali Wikipedia pages and translating them into English using Amazon's +[Mechanical Turk](http://www.mturk.com/). As a warning, many of these pages contain material that is +not typically found in machine translation tutorials. + +Download the data and install it somewhere: + + cd ~/data + wget -q --no-check -O indian-parallel-corpora.zip https://github.com/joshua-decoder/indian-parallel-corpora/archive/master.zip + unzip indian-parallel-corpora.zip + +Then define the environment variable `$INDIAN` to point to it: + + cd ~/data/indian-parallel-corpora-master + export INDIAN=$(pwd) + +### Preparing the data + +Inside this tarball is a directory for each language pair. Within each language directory is another +directory named `tok/`, which contains pre-tokenized and normalized versions of the data. This was +done because the normalization scripts provided with Joshua is written in scripting languages that +often have problems properly handling UTF-8 character sets. We will be using these tokenized +versions, and preventing the pipeline from retokenizing using the `--no-prepare` flag. + +In `$INDIAN/bn-en/tok`, you should see the following files: + + $ ls $INDIAN/bn-en/tok + dev.bn-en.bn devtest.bn-en.bn dict.bn-en.bn test.bn-en.en.2 + dev.bn-en.en.0 devtest.bn-en.en.0 dict.bn-en.en test.bn-en.en.3 + dev.bn-en.en.1 devtest.bn-en.en.1 test.bn-en.bn training.bn-en.bn + dev.bn-en.en.2 devtest.bn-en.en.2 test.bn-en.en.0 training.bn-en.en + dev.bn-en.en.3 devtest.bn-en.en.3 test.bn-en.en.1 + +We will now use this data to test the complete pipeline with a single command. + +### Run the pipeline + +Create an experiments directory for containing your first experiment: + + mkdir ~/expts/joshua + cd ~/expts/joshua + +We will now create the baseline run, using a particular directory structure for experiments that +will allow us to take advantage of scripts provided with Joshua for displaying the results of many +related experiments. + + cd ~/expts/joshua + $JOSHUA/bin/pipeline.pl \ + --rundir 1 \ + --readme "Baseline Hiero run" \ + --source bn \ + --target en \ + --corpus $INDIAN/bn-en/tok/training.bn-en \ + --corpus $INDIAN/bn-en/tok/dict.bn-en \ + --tune $INDIAN/bn-en/tok/dev.bn-en \ + --test $INDIAN/bn-en/tok/devtest.bn-en \ + --lm-order 3 + +This will start the pipeline building a Bengali--English translation system constructed from the +training data and a dictionary, tuned against dev, and tested against devtest. It will use the +default values for most of the pipeline: [GIZA++](https://code.google.com/p/giza-pp/) for alignment, +KenLM's `lmplz` for building the language model, Z-MERT for tuning, KenLM with left-state +minimization for representing LM state in the decoder, and so on. We change the order of the n-gram +model to 3 (from its default of 5) because there is not enough data to build a 5-gram LM. + +A few notes: + +- This will likely take many hours to run, especially if you don't have a Hadoop cluster. + +- If you are running on Mac OS X, KenLM's `lmplz` will not build due to the absence of static + libraries. In that case, you should add the flag `--lm-gen srilm` (recommended, if SRILM is + installed) or `--lm-gen berkeleylm`. + +### Variations + +Once that is finished, you will have a baseline model. From there, you might wish to try variations +of the baseline model. Here are some examples of what you could vary: + +- Build an SAMT model (`--type samt`), GKHM model (`--type ghkm`), or phrasal ITG model (`--type phrasal`) + +- Use the Berkeley aligner instead of GIZA++ (`--aligner berkeley`) + +- Build the language model with BerkeleyLM (`--lm-gen srilm`) instead of KenLM (the default) + +- Change the order of the LM from the default of 5 (`--lm-order 4`) + +- Tune with MIRA instead of MERT (`--tuner mira`). This requires that Moses is installed. + +- Decode with a wider beam (`--joshua-args '-pop-limit 200'`) (the default is 100) + +- Add the provided BN-EN dictionary to the training data (add another `--corpus` line, e.g., `--corpus $INDIAN/bn-en/dict.bn-en`) + +To do this, we will create new runs that partially reuse the results of previous runs. This is +possible by doing two things: (1) incrementing the run directory and providing an updated README +note; (2) telling the pipeline which of the many steps of the pipeline to begin at; and (3) +providing the needed dependencies. + +# A second run + +Let's begin by changing the tuner, to see what effect that has. To do so, we change the run +directory, tell the pipeline to start at the tuning step, and provide the needed dependencies: + + $JOSHUA/bin/pipeline.pl \ + --rundir 2 \ + --readme "Tuning with MIRA" \ + --source bn \ + --target en \ + --corpus $INDIAN/bn-en/tok/training.bn-en \ + --tune $INDIAN/bn-en/tok/dev.bn-en \ + --test $INDIAN/bn-en/tok/devtest.bn-en \ + --first-step tune \ + --tuner mira \ + --grammar 1/grammar.gz \ + --no-corpus-lm \ + --lmfile 1/lm.gz + + Here, we have essentially the same invocation, but we have told the pipeline to use a different + MIRA, to start with tuning, and have provided it with the language model file and grammar it needs + to execute the tuning step. + + Note that we have also told it not to build a language model. This is necessary because the + pipeline always builds an LM on the target side of the training data, if provided, but we are + supplying the language model that was already built. We could equivalently have removed the + `--corpus` line. + +## Changing the model type + +Let's compare the Hiero model we've already built to an SAMT model. We have to reextract the +grammar, but can reuse the alignments and the language model: + + $JOSHUA/bin/pipeline.pl \ + --rundir 3 \ + --readme "Baseline SAMT model" \ + --source bn \ + --target en \ + --corpus $INDIAN/bn-en/tok/training.bn-en \ + --tune $INDIAN/bn-en/tok/dev.bn-en \ + --test $INDIAN/bn-en/tok/devtest.bn-en \ + --alignment 1/alignments/training.align \ + --first-step parse \ + --no-corpus-lm \ + --lmfile 1/lm.gz + +See [the pipeline script page](pipeline.html#steps) for a list of all the steps. + +## Analyzing the results + +We now have three runs, in subdirectories 1, 2, and 3. We can display summary results from them +using the `$JOSHUA/scripts/training/summarize.pl` script. http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/zmert.md ---------------------------------------------------------------------- diff --git a/5.0/zmert.md b/5.0/zmert.md new file mode 100644 index 0000000..d6a5d3c --- /dev/null +++ b/5.0/zmert.md @@ -0,0 +1,83 @@ +--- +layout: default +category: advanced +title: Z-MERT +--- + +This document describes how to manually run the ZMERT module. ZMERT is Joshua's minimum error-rate +training module, written by Omar F. Zaidan. It is easily adapted to drop in different decoders, and +was also written so as to work with different objective functions (other than BLEU). + +((Section (1) in `$JOSHUA/examples/ZMERT/README_ZMERT.txt` is an expanded version of this section)) + +Z-MERT, can be used by launching the driver program (`ZMERT.java`), which expects a config file as +its main argument. This config file can be used to specify any subset of Z-MERT's 20-some +parameters. For a full list of those parameters, and their default values, run ZMERT with a single +-h argument as follows: + + java -cp $JOSHUA/bin joshua.zmert.ZMERT -h + +So what does a Z-MERT config file look like? + +Examine the file `examples/ZMERT/ZMERT_config_ex2.txt`. You will find that it +specifies the following "main" MERT parameters: + + (*) -dir dirPrefix: working directory + (*) -s sourceFile: source sentences (foreign sentences) of the MERT dataset + (*) -r refFile: target sentences (reference translations) of the MERT dataset + (*) -rps refsPerSen: number of reference translations per sentence + (*) -p paramsFile: file containing parameter names, initial values, and ranges + (*) -maxIt maxMERTIts: maximum number of MERT iterations + (*) -ipi initsPerIt: number of intermediate initial points per iteration + (*) -cmd commandFile: name of file containing commands to run the decoder + (*) -decOut decoderOutFile: name of the output file produced by the decoder + (*) -dcfg decConfigFile: name of decoder config file + (*) -N N: size of N-best list (per sentence) generated in each MERT iteration + (*) -v verbosity: output verbosity level (0-2; higher value => more verbose) + (*) -seed seed: seed used to initialize the random number generator + +(Note that the `-s` parameter is only used if Z-MERT is running Joshua as an + internal decoder. If Joshua is run as an external decoder, as is the case in + this README, then this parameter is ignored.) + +To test Z-MERT on the 100-sentence test set of example2, provide this config +file to Z-MERT as follows: + + java -cp bin joshua.zmert.ZMERT -maxMem 500 examples/ZMERT/ZMERT_config_ex2.txt > examples/ZMERT/ZMERT_example/ZMERT.out + +This will run Z-MERT for a couple of iterations on the data from the example2 +folder. (Notice that we have made copies of the source and reference files +from example2 and renamed them as src.txt and ref.* in the MERT_example folder, +just to have all the files needed by Z-MERT in one place.) Once the Z-MERT run +is complete, you should be able to inspect the log file to see what kinds of +things it did. If everything goes well, the run should take a few minutes, of +which more than 95% is time spent by Z-MERT waiting on Joshua to finish +decoding the sentences (once per iteration). + +The output file you get should be equivalent to `ZMERT.out.verbosity1`. If you +rerun the experiment with the verbosity (-v) argument set to 2 instead of 1, +the output file you get should be equivalent to `ZMERT.out.verbosity2`, which has +more interesting details about what Z-MERT does. + +Notice the additional `-maxMem` argument. It tells Z-MERT that it should not +persist to use up memory while the decoder is running (during which time Z-MERT +would be idle). The 500 tells Z-MERT that it can only use a maximum of 500 MB. +For more details on this issue, see section (4) in Z-MERT's README. + +A quick note about Z-MERT's interaction with the decoder. If you examine the +file `decoder_command_ex2.txt`, which is provided as the commandFile (`-cmd`) +argument in Z-MERT's config file, you'll find it contains the command one would +use to run the decoder. Z-MERT launches the commandFile as an external +process, and assumes that it will launch the decoder to produce translations. +(Make sure that commandFile is executable.) After launching this external +process, Z-MERT waits for it to finish, then uses the resulting output file for +parameter tuning (in addition to the output files from previous iterations). +The command file here only has a single command, but your command file could +have multiple lines. Just make sure the command file itself is executable. + +Notice that the Z-MERT arguments `configFile` and `decoderOutFile` (`-cfg` and +`-decOut`) must match the two Joshua arguments in the commandFile's (`-cmd`) single +command. Also, the Z-MERT argument for N must match the value for `top_n` in +Joshua's config file, indicated by the Z-MERT argument configFile (`-cfg`). + +For more details on Z-MERT, refer to `$JOSHUA/examples/ZMERT/README_ZMERT.txt` http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/advanced.md ---------------------------------------------------------------------- diff --git a/6.0/advanced.md b/6.0/advanced.md new file mode 100644 index 0000000..4997e73 --- /dev/null +++ b/6.0/advanced.md @@ -0,0 +1,7 @@ +--- +layout: default6 +category: links +title: Advanced features +--- + + http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/bundle.md ---------------------------------------------------------------------- diff --git a/6.0/bundle.md b/6.0/bundle.md new file mode 100644 index 0000000..f433172 --- /dev/null +++ b/6.0/bundle.md @@ -0,0 +1,100 @@ +--- +layout: default6 +category: links +title: Building a language pack +--- + +*The information in this page applies to Joshua 6.0.3 and greater*. + +Joshua distributes [language packs](/language-packs), which are models +that have been trained and tuned for particular language pairs. You +can easily create your own language pack after you have trained and +tuned a model using the provided +`$JOSHUA/scripts/support/run-bundler.py` script, which gathers files +from a pipeline training directory and bundles them together for easy +distribution and release. + +The script takes just two mandatory arguments in the following order: + +1. The path to the Joshua configuration file to base the bundle + on. This file should contain the tuned weights from the tuning run, so + you can use either the final tuned file from the tuning run + (`tune/joshua.config.final`) or from the test run + (`test/model/joshua.config`). +1. The directory to place the language pack in. If this directory + already exists, the script will die, unless you also pass `--force`. + +In addition, there are a number of other arguments that may be important. + +- `--root /path/to/root`. If file paths in the Joshua config file are + not absolute, you need to provide relative root. If you specify a + tuned pipeline file (such as `tune/joshua.config.final` above), the + paths should all be absolute. If you instead provide a config file + from a previous run bundle (e.g., `test/model/joshua.config`), the + bundle directory above is the relative root. + +- The config file options that are used in the pipeline are likely not + the ones you want if you release a model. For example, the tuning + configuration file contains options that tell Joshua to output 300 + translation candidates for each sentence (`-top-n 300`) and to + include lots of detail about each translation (`-output-format '%i + ||| %s ||| %f ||| %c'`). Because of this, you will want to tell the + run bundler to change many of the config file options to be more + geared towards human-readable output. The default copy-config + options are options are `-top-n 0 -output-format %S -mark-oovs + false`, which accomplishes exactly this (human readability). + +- A very important issue has to do with the translation model (the + "TM", also sometimes called the grammar or phrase table). The + translation model can be very large, so that it takes a long time to + load and to [pack](packing.html). To reduce this time during model + training, the translation model is filtered against the tuning and + testing data in the pipeline, and these filtered models will be what + is listed in the source config files. However, when exporting a + model for use as a language pack, you need to export the full model + instead of the filtered one so as to maximize your coverage on new + test data. The `--tm` parameter is used to accomplish this; it takes + an argument specifying the path to the full model. If you would + additionally like the large model to be [packed](packing.html) (this + is recommended; it reformats the TM so that it can be quickly loaded + at run time), you can use `--pack-tm` instead. You can only pack one + TM (but typically there is only TM anyway). Multiple `--tm` + parameters can be passed; they will replace TMs found in the config + file in the order they are found. + +Here is an example invocation for packing a hierarchical model using +the final tuned Joshua config file: + + ./run-bundler.py \ + --force --verbose \ + /path/to/rundir/tune/joshua.config.final \ + language-pack-YYYY-MM-DD \ + --root /path/to/rundir \ + --pack-tm /path/to/rundir/grammar.gz \ + --copy-config-options \ + '-top-n 1 -output-format %S -mark-oovs false' \ + --server-port 5674 + +The copy config options tell the decoder to present just the +single-best (`-top-n 0`) translated output string that has been +heuristically capitalized (`-output-format %S`), to not append `_OOV` +to OOVs (`-mark-oovs false`), and to use the translation model +`/path/to/rundir/grammar.gz` as the main translation model, packing it +before placing it in the bundle. Note that these arguments to +`--copy-config` are the default, so you could leave this off entirely. +See [this page](decoder.html) for a longer list of decoder options. + +This command is a slight variation used for phrase-based models, which +instead takes the test-set Joshua config (the result is the same): + + ./run-bundler.py \ + --force --verbose \ + /path/to/rundir/test/model/joshua.config \ + --root /path/to/rundir/test/model \ + language-pack-YYYY-MM-DD \ + --pack-tm /path/to/rundir/model/phrase-table.gz \ + --server-port 5674 + +In both cases, a new directory `language-pack-YYYY-MM-DD` will be +created along with a README and a number of support files. + http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/decoder.md ---------------------------------------------------------------------- diff --git a/6.0/decoder.md b/6.0/decoder.md new file mode 100644 index 0000000..e8dc8c9 --- /dev/null +++ b/6.0/decoder.md @@ -0,0 +1,385 @@ +--- +layout: default6 +category: links +title: Decoder configuration parameters +--- + +Joshua configuration parameters affect the runtime behavior of the decoder itself. This page +describes the complete list of these parameters and describes how to invoke the decoder manually. + +To run the decoder, a convenience script is provided that loads the necessary Java libraries. +Assuming you have set the environment variable `$JOSHUA` to point to the root of your installation, +its syntax is: + + $JOSHUA/bin/decoder [-m memory-amount] [-c config-file other-joshua-options ...] + +The `-m` argument, if present, must come first, and the memory specification is in Java format +(e.g., 400m, 4g, 50g). Most notably, the suffixes "m" and "g" are used for "megabytes" and +"gigabytes", and there cannot be a space between the number and the unit. The value of this +argument is passed to Java itself in the invocation of the decoder, and the remaining options are +passed to Joshua. The `-c` parameter has special import because it specifies the location of the +configuration file. + +The Joshua decoder works by reading from STDIN and printing translations to STDOUT as they are +received, according to a number of [output options](#output). If no run-time parameters are +specified (e.g., no translation model), sentences are simply pushed through untranslated. Blank +lines are similarly pushed through as blank lines, so as to maintain parallelism with the input. + +Parameters can be provided to Joshua via a configuration file and from the command +line. Command-line arguments override values found in the configuration file. The format for +configuration file parameters is + + parameter = value + +Command-line options are specified in the following format + + -parameter value + +Values are one of four types (which we list here mostly to call attention to the boolean format): + +- STRING, an arbitrary string (no spaces) +- FLOAT, a floating-point value +- INT, an integer +- BOOLEAN, a boolean value. For booleans, `true` evaluates to true, and all other values evaluate + to false. For command-line options, the value may be omitted, in which case it evaluates to + true. For example, the following are equivalent: + + $JOSHUA/bin/decoder -mark-oovs true + $JOSHUA/bin/decoder -mark-oovs + +## Joshua configuration file + +In addition to the decoder parameters described below, the configuration file contains the model +feature weights. These weights are distinguished from runtime parameters in that they are delimited +by a space instead of an equals sign. They take the following +format, and by convention are placed at the end of the configuration file: + + lm_0 4.23 + tm_pt_0 -0.2 + OOVPenalty -100 + +Joshua can make use of thousands of features, which are described in further detail in the +[feature file](features.html). + +## Joshua decoder parameters + +This section contains a list of the Joshua run-time parameters. An important note about the +parameters is that they are collapsed to canonical form, in which dashes (-) and underscores (-) are +removed and case is converted to lowercase. For example, the following parameter forms are +equivalent (either in the configuration file or from the command line): + + {top-n, topN, top_n, TOP_N, t-o-p-N} + {poplimit, pop-limit, pop-limit, popLimit,PoPlImIt} + +This basically defines equivalence classes of parameters, and relieves you of the task of having to +remember the exact format of each parameter. + +In what follows, we group the configuration parameters in the following groups: + +- [General options](#general) +- [Pruning](#pruning) +- [Translation model options](#tm) +- [Language model options](#lm) +- [Output options](#output) +- [Alternate modes of operation](#modes) + +<a id="general" /> + +### General decoder options + +- `c`, `config` --- *NULL* + + Specifies the configuration file from which Joshua options are loaded. This feature is unique in + that it must be specified from the command line (obviously). + +- `amortize` --- *true* + + When true, specifies that sorting of the rule lists at each trie node in the grammar should be + delayed until the trie node is accessed. When false, all such nodes are sorted before decoding + even begins. Setting to true results in slower per-sentence decoding, but allows the decoder to + begin translating almost immediately (especially with large grammars). + +- `server-port` --- *0* + + If set to a nonzero value, Joshua will start a multithreaded TCP/IP server on the specified + port. Clients can connect to it directly through programming APIs or command-line tools like + `telnet` or `nc`. + + $ $JOSHUA/bin/decoder -m 30g -c /path/to/config/file -server-port 8723 + ... + $ cat input.txt | nc localhost 8723 > results.txt + +- `maxlen` --- *200* + + Input sentences longer than this are truncated. + +- `feature-function` + + Enables a particular feature function. See the [feature function page](features.html) for more information. + +- `oracle-file` --- *NULL* + + The location of a set of oracle reference translations, parallel to the input. When present, + after producing the hypergraph by decoding the input sentence, the oracle is used to rescore the + translation forest with a BLEU approximation in order to extract the oracle-translation from the + forest. This is useful for obtaining an (approximation to an) upper bound on your translation + model under particular search settings. + +- `default-nonterminal` --- *"X"* + + This is the nonterminal symbol assigned to out-of-vocabulary (OOV) items. Joshua assigns this + label to every word of the input, in fact, so that even known words can be translated as OOVs, if + the model prefers them. Usually, a very low weight on the `OOVPenalty` feature discourages their + use unless necessary. + +- `goal-symbol` --- *"GOAL"* + + This is the symbol whose presence in the chart over the whole input span denotes a successful + parse (translation). It should match the LHS nonterminal in your glue grammar. Internally, + Joshua represents nonterminals enclosed in square brackets (e.g., "[GOAL]"), which you can + optionally supply in the configuration file. + +- `true-oovs-only` --- *false* + + By default, Joshua creates an OOV entry for every word in the source sentence, regardless of + whether it is found in the grammar. This allows every word to be pushed through untranslated + (although potentially incurring a high cost based on the `OOVPenalty` feature). If this option is + set, then only true OOVs are entered into the chart as OOVs. To determine "true" OOVs, Joshua + examines the first level of the grammar trie for each word of the input (this isn't a perfect + heuristic, since a word could be present only in deeper levels of the trie). + +- `threads`, `num-parallel-decoders` --- *1* + + This determines how many simultaneous decoding threads to launch. + + Outputs are assembled in order and Joshua has to hold on to the complete target hypergraph until + it is ready to be processed for output, so too many simultaneous threads could result in lots of + memory usage if a long sentence results in many sentences being queued up. We have run Joshua + with as many as 64 threads without any problems of this kind, but it's useful to keep in the back + of your mind. + +- `weights-file` --- NULL + + Weights are appended to the end of the Joshua configuration file, by convention. If you prefer to + put them in a separate file, you can do so, and point to the file with this parameter. + +### Pruning options <a id="pruning" /> + +- `pop-limit` --- *100* + + The number of cube-pruning hypotheses that are popped from the candidates list for each span of + the input. Higher values result in a larger portion of the search space being explored at the + cost of an increased search time. For exhaustive search, set `pop-limit` to 0. + +- `filter-grammar` --- false + + Set to true, this enables dynamic sentence-level filtering. For each sentence, each grammar is + filtered at runtime down to rules that can be applied to the sentence under consideration. This + takes some time (which we haven't thoroughly quantified), but can result in the removal of many + rules that are only partially applicable to the sentence. + +- `constrain-parse` --- *false* +- `use_pos_labels` --- *false* + + *These features are not documented.* + +### Translation model options <a id="tm" /> + +Joshua supports any number of translation models. Conventionally, two are supplied: the main grammar +containing translation rules, and the glue grammar for patching things together. Internally, Joshua +doesn't distinguish between the roles of these grammars; they are treated differently only in that +they typically have different span limits (the maximum input width they can be applied to). + +Grammars are instantiated with config file lines of the following form: + + tm = TYPE OWNER SPAN_LIMIT FILE + +* `TYPE` is the grammar type, which must be set to "thrax". +* `OWNER` is the grammar's owner, which defines the set of [feature weights](features.html) that + apply to the weights found in each line of the grammar (using different owners allows each grammar + to have different sets and numbers of weights, while sharing owners allows weights to be shared + across grammars). +* `SPAN_LIMIT` is the maximum span of the input that rules from this grammar can be applied to. A + span limit of 0 means "no limit", while a span limit of -1 means that rules from this grammar must + be anchored to the left side of the sentence (index 0). +* `FILE` is the path to the file containing the grammar. If the file is a directory, it is assumed + to be [packed](packed.html). Only one packed grammar can currently be used at a time. + +For reference, the following two translation model lines are used by the [pipeline](pipeline.html): + + tm = thrax pt 20 /path/to/packed/grammar + tm = thrax glue -1 /path/to/glue/grammar + +### Language model options <a id="lm" /> + +Joshua supports any number of language models. With Joshua 6.0, these +are just regular feature functions: + + feature-function = LanguageModel -lm_file /path/to/lm/file -lm_order N -lm_type TYPE + feature-function = StateMinimizingLanguageModel -lm_file /path/to/lm/file -lm_order N -lm_type TYPE + +`LanguageModel` is a generic language model, supporting types 'kenlm' +(the default) and 'berkeleylm'. `StateMinimizingLanguageModel` +implements LM state minimization to reduce the size of context n-grams +where appropriate +([Li and Khudanpur, 2008](http://www.aclweb.org/anthology/W08-0402.pdf); +[Heafield et al., 2013](https://aclweb.org/anthology/N/N13/N13-1116.pdf)). This +is currently only supported by KenLM, so the `-lm_type` option is not +available here. + +The other key/value pairs are defined as follows: + +* `lm_type`: one of "kenlm" "berkeleylm" +* `lm_order`: the order of the language model +* `lm_file`: the path to the language model file. All language model + types support the standard ARPA format. Additionally, if the LM + type is "kenlm", this file can be compiled into KenLM's compiled + format (using the program at `$JOSHUA/bin/build_binary`); if the + the LM type is "berkeleylm", it can be compiled by following the + directions in + `$JOSHUA/src/joshua/decoder/ff/lm/berkeley_lm/README`. The + [pipeline](pipeline.html) will automatically compile either type. + +For each language model, you need to specify a feature weight in the following format: + + lm_0 WEIGHT + lm_1 WEIGHT + ... + +where the indices correspond to the order of the language model declaration lines. + +### Output options <a id="output" /> + +- `output-format` *New in 5.0* + + Joshua prints a lot of information to STDERR (making this more granular is on the TODO + list). Output to STDOUT is reserved for decoder translations, and is controlled by the + + - `%i`: the sentence number (0-indexed) + + - `%e`: the source sentence + + - `%s`: the translated sentence + + - `%S`: the translated sentence, with some basic capitalization and denomralization. e.g., + + $ echo "¿ who you lookin' at , mr. ?" | $JOSHUA/bin/decoder -output-format "%S" -mark-oovs false 2> /dev/null + ¿Who you lookin' at, Mr.? + + - `%t`: the target-side tree projection, all printed on one line (PTB style) + + - `%d`: the synchronous derivation, with each rules printed indented on their own lines + + - `%f`: the list of feature values (as name=value pairs) + + - `%c`: the model cost + + - `%w`: the weight vector (unimplemented) + + - `%a`: the alignments between source and target words (currently broken for hierarchical mode) + + The default value is + + output-format = %i ||| %s ||| %f ||| %c + + i.e., + + input ID ||| translation ||| model scores ||| score + +- `top-n` --- *300* + + The number of translation hypotheses to output, sorted in decreasing order of model score + +- `use-unique-nbest` --- *true* + + When constructing the n-best list for a sentence, skip hypotheses whose string has already been + output. + +- `escape-trees` --- *false* + +- `include-align-index` --- *false* + + Output the source words indices that each target word aligns to. + +- `mark-oovs` --- *false* + + if `true`, this causes the text "_OOV" to be appended to each untranslated word in the output. + +- `visualize-hypergraph` --- *false* + + If set to true, a visualization of the hypergraph will be displayed, though you will have to + explicitly include the relevant jar files. See the example usage in + `$JOSHUA/examples/tree_visualizer/`, which contains a demonstration of a source sentence, + translation, and synchronous derivation. + +- `dump-hypergraph` --- "" + + This feature directs that the hypergraph should be written to disk for each input sentence. If + set, the value should contain the string "%d", which is replaced with the sentence number. For + example, + + cat input.txt | $JOSHUA/bin/decoder -dump-hypergraph hgs/%d.txt + + Note that the output directory must exist. + + TODO: revive the + [discussion on a common hypergraph format](http://aclweb.org/aclwiki/index.php?title=Hypergraph_Format) + on the ACL Wiki and support that format. + +### Lattice decoding + +In addition to regular sentences, Joshua can decode weighted lattices encoded in +[the PLF format](http://www.statmt.org/moses/?n=Moses.WordLattices), except that path costs should +be listed as <b>log probabilities</b> instead of probabilities. Lattice decoding was originally +added by Lane Schwartz and [Chris Dyer](http://www.cs.cmu.edu/~cdyer/). + +Joshua will automatically detect whether the input sentence is a regular sentence (the usual case) +or a lattice. If a lattice, a feature will be activated that accumulates the cost of different +paths through the lattice. In this case, you need to ensure that a weight for this feature is +present in [your model file](decoder.html). The [pipeline](pipeline.html) will handle this +automatically, or if you are doing this manually, you can add the line + + SourcePath COST + +to your Joshua configuration file. + +Lattices must be listed one per line. + +### Alternate modes of operation <a id="modes" /> + +In addition to decoding input sentences in the standard way, Joshua supports both *constrained +decoding* and *synchronous parsing*. In both settings, both the source and target sides are provided +as input, and the decoder finds a derivation between them. + +#### Constrained decoding + +To enable constrained decoding, simply append the desired target string as part of the input, in +the following format: + + source sentence ||| target sentence + +Joshua will translate the source sentence constrained to the target sentence. There are a few +caveats: + + * Left-state minimization cannot be enabled for the language model + + * A heuristic is used to constrain the derivation (the LM state must match against the + input). This is not a perfect heuristic, and sometimes results in analyses that are not + perfectly constrained to the input, but have extra words. + +#### Synchronous parsing + +Joshua supports synchronous parsing as a two-step sequence of monolingual parses, as described in +Dyer (NAACL 2010) ([PDF](http://www.aclweb.org/anthology/N10-1033â.pdf)). To enable this: + + - Set the configuration parameter `parse = true`. + + - Remove all language models from the input file + + - Provide input in the following format: + + source sentence ||| target sentence + +You may also wish to display the synchronouse parse tree (`-output-format %t`) and the alignment +(`-show-align-index`). +
