minor updates
Project: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/commit/4b3bdd31 Tree: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/tree/4b3bdd31 Diff: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/diff/4b3bdd31 Branch: refs/heads/master Commit: 4b3bdd31023372b769d9f3bd60f01365503a9e0b Parents: 601d9f8 Author: Matt Post <[email protected]> Authored: Mon Jun 22 22:17:24 2015 -0400 Committer: Matt Post <[email protected]> Committed: Mon Jun 22 22:17:24 2015 -0400 ---------------------------------------------------------------------- 6.0/pipeline.md | 58 ++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 45 insertions(+), 13 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/4b3bdd31/6.0/pipeline.md ---------------------------------------------------------------------- diff --git a/6.0/pipeline.md b/6.0/pipeline.md index f35f618..35d408d 100644 --- a/6.0/pipeline.md +++ b/6.0/pipeline.md @@ -35,7 +35,7 @@ The Joshua pipeline script is designed in the spirit of Moses' `train-model.pl`, the user to define arbitrary execution dependency graphs. However, it is significantly simpler to use, allowing many systems to be built with a single command (that may run for days or weeks). -## Installation +## Dependencies The pipeline has no *required* external dependencies. However, it has support for a number of external packages, some of which are included with Joshua. @@ -67,7 +67,11 @@ external packages, some of which are included with Joshua. standalone installation and use it to extract your grammar. This behavior will be triggered if `$HADOOP` is undefined. -- [SRILM](http://www.speech.sri.com/projects/srilm/) (not included) +- [Moses](http://statmt.org/moses/) (not included). Moses is needed + if you wish to use its 'kbmira' tuner (--tuner kbmira), or if you + wish to build phrase-based models. + +- [SRILM](http://www.speech.sri.com/projects/srilm/) (not included; not needed; not recommended) By default, the pipeline uses the included [KenLM](https://kheafield.com/code/kenlm/) for building (and also querying) language models. Joshua also includes a Java program from the @@ -83,9 +87,9 @@ external packages, some of which are included with Joshua. having been supplanted by [KenLM](http://kheafield.com/code/kenlm/) (the default) and BerkeleyLM). -- [Moses](http://statmt.org/moses/) (not included) - -Make sure that the environment variable `$JOSHUA` is defined, and you should be all set. +After installing any dependencies, follow the brief instructions on +the [installation page](install.html), and then you are ready to build +models. ## A basic pipeline run @@ -124,6 +128,8 @@ Running the pipeline requires two main steps: data preparation and invocation. 1. Run the pipeline. The following is the minimal invocation to run the complete pipeline: $JOSHUA/bin/pipeline.pl \ + --rundir . \ + --type hiero \ --corpus input/train \ --tune input/tune \ --test input/devtest \ @@ -158,7 +164,8 @@ producing BLEU scores at the end. As it runs, you will see output that looks li took 0 seconds (0s) ... -And in the current directory, you will see the following files (among other intermediate files +And in the current directory, you will see the following files (among +other files, including intermediate files generated by the individual sub-steps). data/ @@ -179,6 +186,8 @@ generated by the individual sub-steps). alignments/ 0/ [giza/berkeley aligner output files] + 1/ + ... training.align thrax-hiero.conf thrax.log @@ -193,12 +202,21 @@ generated by the individual sub-steps). mert.log joshua.config.final final-bleu + test/ + model/ + [model files] + output + final-bleu These files will be described in more detail in subsequent sections of this tutorial. Another useful flag is the `--rundir DIR` flag, which chdir()s to the specified directory before running the pipeline. By default the rundir is the current directory. Changing it can be useful -for organizing related pipeline runs. Relative paths specified to other flags (e.g., to `--corpus` +for organizing related pipeline runs. In fact, we highly recommend +that you organize your runs using consecutive integers, also taking a +minute to pass a short note with the `--readme` flag, which allows you +to quickly generate reports on [groups of related experiments](#managing). +Relative paths specified to other flags (e.g., to `--corpus` or `--lmfile`) are relative to the directory the pipeline was called *from*, not the rundir itself (unless they happen to be the same, of course). @@ -217,7 +235,7 @@ of traditional pipeline tasks: These steps are discussed below, after a few intervening sections about high-level details of the pipeline. -## Managing groups of experiments +## <a id="managing" /> Managing groups of experiments The real utility of the pipeline comes when you use it to manage groups of experiments. Typically, there is a held-out test set, and we want to vary a number of training parameters to determine what @@ -225,7 +243,7 @@ effect this has on BLEU scores or some other metric. Joshua comes with a script `$JOSHUA/scripts/training/summarize.pl` that collects information from a group of runs and reports them to you. This script works so long as you organize your runs as follows: -1. Your runs should be grouped together in a root directory, which I'll call `$RUNDIR`. +1. Your runs should be grouped together in a root directory, which I'll call `$EXPDIR`. 2. For comparison purposes, the runs should all be evaluated on the same test set. @@ -241,6 +259,10 @@ the summarize script: [other files] 2/ README.txt + test/ + final-bleu + final-times + [other files] ... You can get such directories using the `--rundir N` flag to the pipeline. @@ -259,9 +281,11 @@ More details are below. ## Grammar options -Joshua can extract three types of grammars: Hiero grammars, GHKM, and SAMT grammars. As described -on the [file formats page](file-formats.html), all of them are encoded into the same file format, -but they differ in terms of the richness of their nonterminal sets. +Hierarchical Joshua can extract three types of grammars: Hiero +grammars, GHKM, and SAMT grammars. As described on the +[file formats page](file-formats.html), all of them are encoded into +the same file format, but they differ in terms of the richness of +their nonterminal sets. Hiero grammars make use of a single nonterminals, and are extracted by computing phrases from word-based alignments and then subtracting out phrase differences. More detail can be found in @@ -280,6 +304,12 @@ By default, the Joshua pipeline extract a Hiero grammar, but this can be altered but you can also use Moses' extractor with `--ghkm-extractor moses`. Galley's extractor only outputs two features, so the scores tend to be significantly lower than that of Moses'. +Joshua (new in version 6) also includes an unlexicalized phrase-based +decoder. Building a phrase-based model requires you to have Moses +installed, since its `train-model.perl` script is used to extract the +phrase table. You can enable this by defining the `$MOSES` environment +variable and then specifying `--type phrase`. + ## Other high-level options The following command-line arguments control run-time behavior of multiple steps: @@ -294,7 +324,9 @@ The following command-line arguments control run-time behavior of multiple steps This enables parallel operation over a cluster using the qsub command. This feature is not well-documented at this point, but you will likely want to edit the file `$JOSHUA/scripts/training/parallelize/LocalConfig.pm` to setup your qsub environment, and may also - want to pass specific qsub commands via the `--qsub-args "ARGS"` command. + want to pass specific qsub commands via the `--qsub-args "ARGS"` + command. We suggest you stick to the standard Joshua model that + tries to use as many cores as are available with the `--threads N` option. ## Restarting failed runs
