[16/18] incubator-joshua-site git commit: Initial import of joshua-decoder.github.com site to Apache

lewismc Mon, 04 Apr 2016 22:13:30 -0700

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/large-lms.md
----------------------------------------------------------------------
diff --git a/5.0/large-lms.md b/5.0/large-lms.md
new file mode 100644
index 0000000..28ba0b9
--- /dev/null
+++ b/5.0/large-lms.md
@@ -0,0 +1,192 @@
+---
+layout: default
+title: Building large LMs with SRILM
+category: advanced
+---
+
+The following is a tutorial for building a large language model from the
+English Gigaword Fifth Edition corpus
+[LDC2011T07](http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07)
+using SRILM. English text is provided from seven different sources.
+
+### Step 0: Clean up the corpus
+
+The Gigaword corpus has to be stripped of all SGML tags and tokenized.
+Instructions for performing those steps are not included in this
+documentation. A description of this process can be found in a paper
+called ["Annotated
+Gigaword"](https://akbcwekex2012.files.wordpress.com/2012/05/28_paper.pdf).
+
+The Joshua package ships with a script that converts all alphabetical
+characters to their lowercase equivalent. The script is located at
+`$JOSHUA/scripts/lowercase.perl`.
+
+Make a directory structure as follows:
+
+    gigaword/
+    âââ corpus/
+    âÂ Â  âââ afp_eng/
+    âÂ Â  âÂ Â  âââ afp_eng_199405.lc.gz
+    âÂ Â  âÂ Â  âââ afp_eng_199406.lc.gz
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ apw_eng/
+    âÂ Â  âÂ Â  âââ apw_eng_199411.lc.gz
+    âÂ Â  âÂ Â  âââ apw_eng_199412.lc.gz
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ cna_eng/
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ ltw_eng/
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ nyt_eng/
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ wpb_eng/
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ xin_eng/
+    âÂ Â   Â Â  âââ ...
+    âÂ Â   Â Â  âââ counts/
+    âââ lm/
+     Â Â  âââ afp_eng/
+     Â Â  âââ apw_eng/
+     Â Â  âââ cna_eng/
+     Â Â  âââ ltw_eng/
+     Â Â  âââ nyt_eng/
+     Â Â  âââ wpb_eng/
+     Â Â  âââ xin_eng/
+
+
+The next step will be to build smaller LMs and then interpolate them into one
+file.
+
+### Step 1: Count ngrams
+
+Run the following script once from each source directory under the `corpus/`
+directory (edit it to specify the path to the `ngram-count` binary as well as
+the number of processors):
+
+    #!/bin/sh
+
+    NGRAM_COUNT=$SRILM_SRC/bin/i686-m64/ngram-count
+    args=""
+
+    for source in *.gz; do
+       args=$args"-sort -order 5 -text $source -write counts/$source-counts.gz 
"
+    done
+
+    echo $args | xargs --max-procs=4 -n 7 $NGRAM_COUNT
+
+Then move each `counts/` directory to the corresponding directory under
+`lm/`. Now that each ngram has been counted, we can make a language
+model for each of the seven sources.
+
+### Step 2: Make individual language models
+
+SRILM includes a script, called `make-big-lm`, for building large language
+models under resource-limited environments. The manual for this script can be
+read online
+[here](http://www-speech.sri.com/projects/srilm/manpages/training-scripts.1.html).
+Since the Gigaword corpus is so large, it is convenient to use `make-big-lm`
+even in environments with many parallel processors and a lot of memory.
+
+Initiate the following script from each of the source directories under the
+`lm/` directory (edit it to specify the path to the `make-big-lm` script as
+well as the pruning threshold):
+
+    #!/bin/bash
+    set -x
+
+    CMD=$SRILM_SRC/bin/make-big-lm
+    PRUNE_THRESHOLD=1e-8
+
+    $CMD \
+      -name gigalm `for k in counts/*.gz; do echo " \
+      -read $k "; done` \
+      -lm lm.gz \
+      -max-per-file 100000000 \
+      -order 5 \
+      -kndiscount \
+      -interpolate \
+      -unk \
+      -prune $PRUNE_THRESHOLD
+
+The language model attributes chosen are the following:
+
+* N-grams up to order 5
+* Kneser-Ney smoothing
+* N-gram probability estimates at the specified order *n* are interpolated with
+  lower-order estimates
+* include the unknown-word token as a regular word
+* pruning N-grams based on the specified threshold
+
+Next, we will mix the models together into a single file.
+
+### Step 3: Mix models together
+
+Using development text, interpolation weights can determined that give highest
+weight to the source language models that have the lowest perplexity on the
+specified development set.
+
+#### Step 3-1: Determine interpolation weights
+
+Initiate the following script from the `lm/` directory (edit it to specify the
+path to the `ngram` binary as well as the path to the development text file):
+
+    #!/bin/bash
+    set -x
+
+    NGRAM=$SRILM_SRC/bin/i686-m64/ngram
+    DEV_TEXT=~mpost/expts/wmt12/runs/es-en/data/tune/tune.tok.lc.es
+
+    dirs=( afp_eng apw_eng cna_eng ltw_eng nyt_eng wpb_eng xin_eng )
+
+    for d in ${dirs[@]} ; do
+      $NGRAM -debug 2 -order 5 -unk -lm $d/lm.gz -ppl $DEV_TEXT > $d/lm.ppl ;
+    done
+
+    compute-best-mix */lm.ppl > best-mix.ppl
+
+Take a look at the contents of `best-mix.ppl`. It will contain a sequence of
+values in parenthesis. These are the interpolation weights of the source
+language models in the order specified. Copy and paste the values within the
+parenthesis into the script below.
+
+#### Step 3-2: Combine the models
+
+Initiate the following script from the `lm/` directory (edit it to specify the
+path to the `ngram` binary as well as the interpolation weights):
+
+    #!/bin/bash
+    set -x
+
+    NGRAM=$SRILM_SRC/bin/i686-m64/ngram
+    DIRS=(   afp_eng    apw_eng     cna_eng  ltw_eng   nyt_eng  wpb_eng  
xin_eng )
+    LAMBDAS=(0.00631272 0.000647602 0.251555 0.0134726 0.348953 0.371566 
0.00749238)
+
+    $NGRAM -order 5 -unk \
+      -lm      ${DIRS[0]}/lm.gz     -lambda  ${LAMBDAS[0]} \
+      -mix-lm  ${DIRS[1]}/lm.gz \
+      -mix-lm2 ${DIRS[2]}/lm.gz -mix-lambda2 ${LAMBDAS[2]} \
+      -mix-lm3 ${DIRS[3]}/lm.gz -mix-lambda3 ${LAMBDAS[3]} \
+      -mix-lm4 ${DIRS[4]}/lm.gz -mix-lambda4 ${LAMBDAS[4]} \
+      -mix-lm5 ${DIRS[5]}/lm.gz -mix-lambda5 ${LAMBDAS[5]} \
+      -mix-lm6 ${DIRS[6]}/lm.gz -mix-lambda6 ${LAMBDAS[6]} \
+      -write-lm mixed_lm.gz
+
+The resulting file, `mixed_lm.gz` is a language model based on all the text in
+the Gigaword corpus and with some probabilities biased to the development text
+specify in step 3-1. It is in the ARPA format. The optional next step converts
+it into KenLM format.
+
+#### Step 3-3: Convert to KenLM
+
+The KenLM format has some speed advantages over the ARPA format. Issuing the
+following command will write a new language model file `mixed_lm-kenlm.gz` that
+is the `mixed_lm.gz` language model transformed into the KenLM format.
+
+    $JOSHUA/src/joshua/decoder/ff/lm/kenlm/build_binary mixed_lm.gz 
mixed_lm.kenlm
+


http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/packing.md
----------------------------------------------------------------------
diff --git a/5.0/packing.md b/5.0/packing.md
new file mode 100644
index 0000000..2f39ba7
--- /dev/null
+++ b/5.0/packing.md
@@ -0,0 +1,76 @@
+---
+layout: default
+category: advanced
+title: Grammar Packing
+---
+
+Grammar packing refers to the process of taking a textual grammar output by 
[Thrax](thrax.html) and
+efficiently encoding it for use by Joshua.  Packing the grammar results in 
significantly faster load
+times for very large grammars.
+
+Soon, the [Joshua pipeline script](pipeline.html) will add support for grammar 
packing
+automatically, and we will provide a script that automates these steps for you.
+
+1. Make sure the grammar is labeled.  A labeled grammar is one that has 
feature names attached to
+each of the feature values in each row of the grammar file.  Here is a line 
from an unlabeled
+grammar:
+
+        [X] ||| [X,1] à¦à¦¨à§à¦¯à¦¾à¦¨à§à¦¯ [X,2] ||| [X,1] other [X,2] ||| 
0 0 1 0 0 1.02184
+
+   and here is one from an labeled grammar (note that the labels are not very 
useful):
+
+        [X] ||| [X,1] à¦à¦¨à§à¦¯à¦¾à¦¨à§à¦¯ [X,2] ||| [X,1] other [X,2] ||| 
f1=0 f2=0 f3=1 f4=0 f5=0 f6=1.02184
+
+   If your grammar is not labeled, you can use the script 
`$JOSHUA/scripts/label_grammar.py`:
+   
+        zcat grammar.gz | $JOSHUA/scripts/label_grammar.py > grammar-labeled.gz
+
+   As a side-effect of this step is to produce a file 'dense_map' in the 
current directory,
+   containing the mapping between feature names and feature columns.  This 
file is needed in later
+   steps.
+
+1. The packer needs a sorted grammar.  It is sufficient to sort by the first 
word:
+
+        zcat grammar-labeled.gz | sort -k3,3 | gzip > grammar-sorted.gz
+      
+   (The reason we need a sorted grammar is because the packer stores the 
grammar in a trie.  The
+   pieces can't be more than 2 GB due to Java limitations, so we need to 
ensure that rules are
+   grouped by the first arc in the trie to avoid redundancy across tries and 
to simplify the
+   lookup).
+    
+1. In order to pack the grammar, we need two pieces of information: (1) a 
packer configuration file,
+   and (2) a dense map file.
+
+   1. Write a packer config file.  This file specifies items such as the chunk 
size (for the packed
+      pieces) and the quantization classes and types for each feature name.  
Examples can be found
+      at
+   
+            $JOSHUA/test/packed/packer.config
+            $JOSHUA/test/bn-en/packed/packer.quantized
+            $JOSHUA/test/bn-en/packed/packer.uncompressed
+       
+      The quantizer lines in the packer config file have the following format:
+   
+            quantizer TYPE FEATURES
+       
+       where `TYPE` is one of `boolean`, `float`, `byte`, or `8bit`, and 
`FEATURES` is a
+       space-delimited list of feature names that have that quantization type.
+   
+   1. Write a dense_map file.  If you labeled an unlabeled grammar, this was 
produced for you as a
+      side product of the `label_grammar.py` script you called in Step 1.  
Otherwise, you need to
+      create a file that lists the mapping between feature names and 
(0-indexed) columns in the
+      grammar, one per line, in the following format:
+   
+            feature-index feature-name
+    
+1. To pack the grammar, type the following command:
+
+        java -cp $JOSHUA/bin joshua.tools.GrammarPacker -c PACKER_CONFIG_FILE 
-p OUTPUT_DIR -g GRAMMAR_FILE
+
+    This will read in your packer configuration file and your grammar, and 
produced a packed grammar
+    in the output directory.
+
+1. To use the packed grammar, just point to the packed directory in your 
Joshua configuration file.
+
+        tm-file = packed-grammar/
+        tm-format = packed

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/pipeline.md
----------------------------------------------------------------------
diff --git a/5.0/pipeline.md b/5.0/pipeline.md
new file mode 100644
index 0000000..fbe052d
--- /dev/null
+++ b/5.0/pipeline.md
@@ -0,0 +1,640 @@
+---
+layout: default
+category: links
+title: The Joshua Pipeline
+---
+
+This page describes the Joshua pipeline script, which manages the complexity 
of training and
+evaluating machine translation systems.  The pipeline eases the pain of two 
related tasks in
+statistical machine translation (SMT) research:
+
+- Training SMT systems involves a complicated process of interacting steps 
that are
+  time-consuming and prone to failure.
+
+- Developing and testing new techniques requires varying parameters at 
different points in the
+  pipeline. Earlier results (which are often expensive) need not be recomputed.
+
+To facilitate these tasks, the pipeline script:
+
+- Runs the complete SMT pipeline, from corpus normalization and tokenization, 
through alignment,
+  model building, tuning, test-set decoding, and evaluation.
+
+- Caches the results of intermediate steps (using robust SHA-1 checksums on 
dependencies), so the
+  pipeline can be debugged or shared across similar runs while doing away with 
time spent
+  recomputing expensive steps.
+ 
+- Allows you to jump into and out of the pipeline at a set of predefined 
places (e.g., the alignment
+  stage), so long as you provide the missing dependencies.
+
+The Joshua pipeline script is designed in the spirit of Moses' 
`train-model.pl`, and shares many of
+its features.  It is not as extensive, however, as Moses'
+[Experiment Management 
System](http://www.statmt.org/moses/?n=FactoredTraining.EMS), which allows
+the user to define arbitrary execution dependency graphs.
+
+## Installation
+
+The pipeline has no *required* external dependencies.  However, it has support 
for a number of
+external packages, some of which are included with Joshua.
+
+-  [GIZA++](http://code.google.com/p/giza-pp/) (included)
+
+   GIZA++ is the default aligner.  It is included with Joshua, and should 
compile successfully when
+   you typed `ant` from the Joshua root directory.  It is not required because 
you can use the
+   (included) Berkeley aligner (`--aligner berkeley`). We have recently also 
provided support
+   for the [Jacana-XY 
aligner](http://code.google.com/p/jacana-xy/wiki/JacanaXY) (`--aligner
+   jacana`). 
+
+-  [Hadoop](http://hadoop.apache.org/) (included)
+
+   The pipeline uses the [Thrax grammar extractor](thrax.html), which is built 
on Hadoop.  If you
+   have a Hadoop installation, simply ensure that the `$HADOOP` environment 
variable is defined, and
+   the pipeline will use it automatically at the grammar extraction step.  If 
you are going to
+   attempt to extract very large grammars, it is best to have a good-sized 
Hadoop installation.
+   
+   (If you do not have a Hadoop installation, you might consider setting one 
up.  Hadoop can be
+   installed in a
+   
["pseudo-distributed"](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed)
+   mode that allows it to use just a few machines or a number of processors on 
a single machine.
+   The main issue is to ensure that there are a lot of independent physical 
disks, since in our
+   experience Hadoop starts to exhibit lots of hard-to-trace problems if there 
is too much demand on
+   the disks.)
+   
+   If you don't have a Hadoop installation, there are still no worries.  The 
pipeline will unroll a
+   standalone installation and use it to extract your grammar.  This behavior 
will be triggered if
+   `$HADOOP` is undefined.
+   
+-  [SRILM](http://www.speech.sri.com/projects/srilm/) (not included)
+
+   By default, the pipeline uses a Java program from the
+   [Berkeley LM](http://code.google.com/p/berkeleylm/) package that constructs 
an
+   Kneser-Ney-smoothed language model in ARPA format from the target side of 
your training data.  If
+   you wish to use SRILM instead, you need to do the following:
+   
+   1. Install SRILM and set the `$SRILM` environment variable to point to its 
installed location.
+   1. Add the `--lm-gen srilm` flag to your pipeline invocation.
+   
+   More information on this is available in the [LM building section of the 
pipeline](#lm).  SRILM
+   is not used for representing language models during decoding (and in fact 
is not supported,
+   having been supplanted by [KenLM](http://kheafield.com/code/kenlm/) (the 
default) and
+   BerkeleyLM).
+
+-  [Moses](http://statmt.org/moses/) (not included)
+
+Make sure that the environment variable `$JOSHUA` is defined, and you should 
be all set.
+
+## A basic pipeline run
+
+The pipeline takes a set of inputs (training, tuning, and test data), and 
creates a set of
+intermediate files in the *run directory*.  By default, the run directory is 
the current directory,
+but it can be changed with the `--rundir` parameter.
+
+For this quick start, we will be working with the example that can be found in
+`$JOSHUA/examples/pipeline`.  This example contains 1,000 sentences of 
Urdu-English data (the full
+dataset is available as part of the
+[Indian languages parallel corpora](/indian-parallel-corpora/) with
+100-sentence tuning and test sets with four references each.
+
+Running the pipeline requires two main steps: data preparation and invocation.
+
+1. Prepare your data.  The pipeline script needs to be told where to find the 
raw training, tuning,
+   and test data.  A good convention is to place these files in an input/ 
subdirectory of your run's
+   working directory (NOTE: do not use `data/`, since a directory of that name 
is created and used
+   by the pipeline itself for storing processed files).  The expected format 
(for each of training,
+   tuning, and test) is a pair of files that share a common path prefix and 
are distinguished by
+   their extension, e.g.,
+
+       input/
+             train.SOURCE
+             train.TARGET
+             tune.SOURCE
+             tune.TARGET
+             test.SOURCE
+             test.TARGET
+
+   These files should be parallel at the sentence level (with one sentence per 
line), should be in
+   UTF-8, and should be untokenized (tokenization occurs in the pipeline).  
SOURCE and TARGET denote
+   variables that should be replaced with the actual target and source 
language abbreviations (e.g.,
+   "ur" and "en").
+   
+1. Run the pipeline.  The following is the minimal invocation to run the 
complete pipeline:
+
+       $JOSHUA/bin/pipeline.pl  \
+         --corpus input/train   \
+         --tune input/tune      \
+         --test input/devtest   \
+         --source SOURCE        \
+         --target TARGET
+
+   The `--corpus`, `--tune`, and `--test` flags define file prefixes that are 
concatened with the
+   language extensions given by `--target` and `--source` (with a "." in 
between).  Note the
+   correspondences with the files defined in the first step above.  The 
prefixes can be either
+   absolute or relative pathnames.  This particular invocation assumes that a 
subdirectory `input/`
+   exists in the current directory, that you are translating from a language 
identified "ur"
+   extension to a language identified by the "en" extension, that the training 
data can be found at
+   `input/train.en` and `input/train.ur`, and so on.
+
+*Don't* run the pipeline directly from `$JOSHUA`. We recommend creating a run 
directory somewhere
+ else to contain all of your experiments in some other location. The advantage 
to this (apart from
+ not clobbering part of the Joshua install) is that Joshua provides support 
scripts for visualizing
+ the results of a series of experiments that only work if you
+
+Assuming no problems arise, this command will run the complete pipeline in 
about 20 minutes,
+producing BLEU scores at the end.  As it runs, you will see output that looks 
like the following:
+   
+    [train-copy-en] rebuilding...
+      dep=/Users/post/code/joshua/test/pipeline/input/train.en 
+      dep=data/train/train.en.gz [NOT FOUND]
+      cmd=cat /Users/post/code/joshua/test/pipeline/input/train.en | gzip -9n 
> data/train/train.en.gz
+      took 0 seconds (0s)
+    [train-copy-ur] rebuilding...
+      dep=/Users/post/code/joshua/test/pipeline/input/train.ur 
+      dep=data/train/train.ur.gz [NOT FOUND]
+      cmd=cat /Users/post/code/joshua/test/pipeline/input/train.ur | gzip -9n 
> data/train/train.ur.gz
+      took 0 seconds (0s)
+    ...
+   
+And in the current directory, you will see the following files (among other 
intermediate files
+generated by the individual sub-steps).
+   
+    data/
+        train/
+            corpus.ur
+            corpus.en
+            thrax-input-file
+        tune/
+            tune.tok.lc.ur
+            tune.tok.lc.en
+            grammar.filtered.gz
+            grammar.glue
+        test/
+            test.tok.lc.ur
+            test.tok.lc.en
+            grammar.filtered.gz
+            grammar.glue
+    alignments/
+        0/
+            [giza/berkeley aligner output files]
+        training.align
+    thrax-hiero.conf
+    thrax.log
+    grammar.gz
+    lm.gz
+    tune/
+        1/
+            decoder_command
+            joshua.config
+            params.txt
+            joshua.log
+            mert.log
+            joshua.config.ZMERT.final
+        final-bleu
+
+These files will be described in more detail in subsequent sections of this 
tutorial.
+
+Another useful flag is the `--rundir DIR` flag, which chdir()s to the 
specified directory before
+running the pipeline.  By default the rundir is the current directory.  
Changing it can be useful
+for organizing related pipeline runs.  Relative paths specified to other flags 
(e.g., to `--corpus`
+or `--lmfile`) are relative to the directory the pipeline was called *from*, 
not the rundir itself
+(unless they happen to be the same, of course).
+
+The complete pipeline comprises many tens of small steps, which can be grouped 
together into a set
+of traditional pipeline tasks:
+   
+1. [Data preparation](#prep)
+1. [Alignment](#alignment)
+1. [Parsing](#parsing) (syntax-based grammars only)
+1. [Grammar extraction](#tm)
+1. [Language model building](#lm)
+1. [Tuning](#tuning)
+1. [Testing](#testing)
+1. [Analysis](#analysis)
+
+These steps are discussed below, after a few intervening sections about 
high-level details of the
+pipeline.
+
+## Managing groups of experiments
+
+The real utility of the pipeline comes when you use it to manage groups of 
experiments. Typically,
+there is a held-out test set, and we want to vary a number of training 
parameters to determine what
+effect this has on BLEU scores or some other metric. Joshua comes with a script
+`$JOSHUA/scripts/training/summarize.pl` that collects information from a group 
of runs and reports
+them to you. This script works so long as you organize your runs as follows:
+
+1. Your runs should be grouped together in a root directory, which I'll call 
`$RUNDIR`.
+
+2. For comparison purposes, the runs should all be evaluated on the same test 
set.
+
+3. Each run in the run group should be in its own numbered directory, shown 
with the files used by
+the summarize script:
+
+       $RUNDIR/
+           1/
+               README.txt
+               test/
+                   final-bleu
+                   final-times
+               [other files]
+           2/
+               README.txt
+               ...
+               
+You can get such directories using the `--rundir N` flag to the pipeline. 
+
+Run directories can build off each other. For example, `1/` might contain a 
complete baseline
+run. If you wanted to just change the tuner, you don't need to rerun the 
aligner and model builder,
+so you can reuse the results by supplying the second run with the information 
it needs that was
+computed in step 1:
+
+    $JOSHUA/bin/pipeline.pl \
+      --first-step tune \
+      --grammar 1/grammar.gz \
+      ...
+      
+More details are below.
+
+## Grammar options
+
+Joshua can extract three types of grammars: Hiero grammars, GHKM, and SAMT 
grammars.  As described
+on the [file formats page](file-formats.html), all of them are encoded into 
the same file format,
+but they differ in terms of the richness of their nonterminal sets.
+
+Hiero grammars make use of a single nonterminals, and are extracted by 
computing phrases from
+word-based alignments and then subtracting out phrase differences.  More 
detail can be found in
+[Chiang (2007) 
[PDF]](http://www.mitpressjournals.org/doi/abs/10.1162/coli.2007.33.2.201).
+[GHKM](http://www.isi.edu/%7Emarcu/papers/cr_ghkm_naacl04.pdf) (new with 5.0) 
and
+[SAMT](http://www.cs.cmu.edu/~zollmann/samt/) grammars make use of a source- 
or target-side parse
+tree on the training data, differing in the way they extract rules using these 
trees: GHKM extracts
+synchronous tree substitution grammar rules rooted in a subset of the tree 
constituents, whereas
+SAMT projects constituent labels down onto phrases.  SAMT grammars are usually 
many times larger and
+are much slower to decode with, but sometimes increase BLEU score.  Both 
grammar formats are
+extracted with the [Thrax software](thrax.html).
+
+By default, the Joshua pipeline extract a Hiero grammar, but this can be 
altered with the `--type
+(ghkm|samt)` flag. For GHKM grammars, the default is to use
+[Michel Galley's 
extractor](http://www-nlp.stanford.edu/~mgalley/software/stanford-ghkm-latest.tar.gz),
+but you can also use Moses' extractor with `--ghkm-extractor moses`. Galley's 
extractor only outputs
+two features, so the scores tend to be significantly lower than that of Moses'.
+
+## Other high-level options
+
+The following command-line arguments control run-time behavior of multiple 
steps:
+
+- `--threads N` (1)
+
+  This enables multithreaded operation for a number of steps: alignment (with 
GIZA, max two
+  threads), parsing, and decoding (any number of threads)
+  
+- `--jobs N` (1)
+
+  This enables parallel operation over a cluster using the qsub command.  This 
feature is not
+  well-documented at this point, but you will likely want to edit the file
+  `$JOSHUA/scripts/training/parallelize/LocalConfig.pm` to setup your qsub 
environment, and may also
+  want to pass specific qsub commands via the `--qsub-args "ARGS"` command.
+
+## Restarting failed runs
+
+If the pipeline dies, you can restart it with the same command you used the 
first time.  If you
+rerun the pipeline with the exact same invocation as the previous run (or an 
overlapping
+configuration -- one that causes the same set of behaviors), you will see 
slightly different
+output compared to what we saw above:
+
+    [train-copy-en] cached, skipping...
+    [train-copy-ur] cached, skipping...
+    ...
+
+This indicates that the caching module has discovered that the step was 
already computed and thus
+did not need to be rerun.  This feature is quite useful for restarting 
pipeline runs that have
+crashed due to bugs, memory limitations, hardware failures, and the myriad 
other problems that
+plague MT researchers across the world.
+
+Often, a command will die because it was parameterized incorrectly.  For 
example, perhaps the
+decoder ran out of memory.  This allows you to adjust the parameter (e.g., 
`--joshua-mem`) and rerun
+the script.  Of course, if you change one of the parameters a step depends on, 
it will trigger a
+rerun, which in turn might trigger further downstream reruns.
+   
+## <a id="steps" /> Skipping steps, quitting early
+
+You will also find it useful to start the pipeline somewhere other than data 
preparation (for
+example, if you have already-processed data and an alignment, and want to 
begin with building a
+grammar) or to end it prematurely (if, say, you don't have a test set and just 
want to tune a
+model).  This can be accomplished with the `--first-step` and `--last-step` 
flags, which take as
+argument a case-insensitive version of the following steps:
+
+- *FIRST*: Data preparation.  Everything begins with data preparation.  This 
is the default first
+   step, so there is no need to be explicit about it.
+
+- *ALIGN*: Alignment.  You might want to start here if you want to skip data 
preprocessing.
+
+- *PARSE*: Parsing.  This is only relevant for building SAMT grammars (`--type 
samt`), in which case
+   the target side (`--target`) of the training data (`--corpus`) is parsed 
before building a
+   grammar.
+
+- *THRAX*: Grammar extraction [with Thrax](thrax.html).  If you jump to this 
step, you'll need to
+   provide an aligned corpus (`--alignment`) along with your parallel data.  
+
+- *TUNE*: Tuning.  The exact tuning method is determined with `--tuner 
{mert,mira,pro}`.  With this
+   option, you need to specify a grammar (`--grammar`) or separate tune 
(`--tune-grammar`) and test
+   (`--test-grammar`) grammars.  A full grammar (`--grammar`) will be filtered 
against the relevant
+   tuning or test set unless you specify `--no-filter-tm`.  If you want a 
language model built from
+   the target side of your training data, you'll also need to pass in the 
training corpus
+   (`--corpus`).  You can also specify an arbitrary number of additional 
language models with one or
+   more `--lmfile` flags.
+
+- *TEST*: Testing.  If you have a tuned model file, you can test new corpora 
by passing in a test
+   corpus with references (`--test`).  You'll need to provide a run name 
(`--name`) to store the
+   results of this run, which will be placed under `test/NAME`.  You'll also 
need to provide a
+   Joshua configuration file (`--joshua-config`), one or more language models 
(`--lmfile`), and a
+   grammar (`--grammar`); this will be filtered to the test data unless you 
specify
+   `--no-filter-tm`) or unless you directly provide a filtered test grammar 
(`--test-grammar`).
+
+- *LAST*: The last step.  This is the default target of `--last-step`.
+
+We now discuss these steps in more detail.
+
+### <a id="prep" /> 1. DATA PREPARATION
+
+Data prepare involves doing the following to each of the training data 
(`--corpus`), tuning data
+(`--tune`), and testing data (`--test`).  Each of these values is an absolute 
or relative path
+prefix.  To each of these prefixes, a "." is appended, followed by each of 
SOURCE (`--source`) and
+TARGET (`--target`), which are file extensions identifying the languages.  The 
SOURCE and TARGET
+files must have the same number of lines.  
+
+For tuning and test data, multiple references are handled automatically.  A 
single reference will
+have the format TUNE.TARGET, while multiple references will have the format 
TUNE.TARGET.NUM, where
+NUM starts at 0 and increments for as many references as there are.
+
+The following processing steps are applied to each file.
+
+1.  **Copying** the files into `$RUNDIR/data/TYPE`, where TYPE is one of 
"train", "tune", or "test".
+    Multiple `--corpora` files are concatenated in the order they are 
specified.  Multiple `--tune`
+    and `--test` flags are not currently allowed.
+    
+1.  **Normalizing** punctuation and text (e.g., removing extra spaces, 
converting special
+    quotations).  There are a few language-specific options that depend on the 
file extension
+    matching the [two-letter ISO 
639-1](http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
+    designation.
+
+1.  **Tokenizing** the data (e.g., separating out punctuation, converting 
brackets).  Again, there
+    are language-specific tokenizations for a few languages (English, German, 
and Greek).
+
+1.  (Training only) **Removing** all parallel sentences with more than 
`--maxlen` tokens on either
+    side.  By default, MAXLEN is 50.  To turn this off, specify `--maxlen 0`.
+
+1.  **Lowercasing**.
+
+This creates a series of intermediate files which are saved for posterity but 
compressed.  For
+example, you might see
+
+    data/
+        train/
+            train.en.gz
+            train.tok.en.gz
+            train.tok.50.en.gz
+            train.tok.50.lc.en
+            corpus.en -> train.tok.50.lc.en
+
+The file "corpus.LANG" is a symbolic link to the last file in the chain.  
+
+## 2. ALIGNMENT <a id="alignment" />
+
+Alignments are between the parallel corpora at 
`$RUNDIR/data/train/corpus.{SOURCE,TARGET}`.  To
+prevent the alignment tables from getting too big, the parallel corpora are 
grouped into files of no
+more than ALIGNER\_CHUNK\_SIZE blocks (controlled with a parameter below).  
The last block is folded
+into the penultimate block if it is too small.  These chunked files are all 
created in a
+subdirectory of `$RUNDIR/data/train/splits`, named `corpus.LANG.0`, 
`corpus.LANG.1`, and so on.
+
+The pipeline parameters affecting alignment are:
+
+-   `--aligner ALIGNER` {giza (default), berkeley, jacana}
+
+    Which aligner to use.  The default is 
[GIZA++](http://code.google.com/p/giza-pp/), but
+    [the Berkeley aligner](http://code.google.com/p/berkeleyaligner/) can be 
used instead.  When
+    using the Berkeley aligner, you'll want to pay attention to how much 
memory you allocate to it
+    with `--aligner-mem` (the default is 10g).
+
+-   `--aligner-chunk-size SIZE` (1,000,000)
+
+    The number of sentence pairs to compute alignments over. The training data 
is split into blocks
+    of this size, aligned separately, and then concatenated.
+    
+-   `--alignment FILE`
+
+    If you have an already-computed alignment, you can pass that to the script 
using this flag.
+    Note that, in this case, you will want to skip data preparation and 
alignment using
+    `--first-step thrax` (the first step after alignment) and also to specify 
`--no-prepare` so
+    as not to retokenize the data and mess with your alignments.
+    
+    The alignment file format is the standard format where 0-indexed many-many 
alignment pairs for a
+    sentence are provided on a line, source language first, e.g.,
+
+      0-0 0-1 1-2 1-7 ...
+
+    This value is required if you start at the grammar extraction step.
+
+When alignment is complete, the alignment file can be found at 
`$RUNDIR/alignments/training.align`.
+It is parallel to the training corpora.  There are many files in the 
`alignments/` subdirectory that
+contain the output of intermediate steps.
+
+### <a id="parsing" /> 3. PARSING
+
+To build SAMT and GHKM grammars (`--type samt` and `--type ghkm`), the target 
side of the
+training data must be parsed. The pipeline assumes your target side will be 
English, and will parse
+it for you using [the Berkeley 
parser](http://code.google.com/p/berkeleyparser/), which is included.
+If it is not the case that English is your target-side language, the target 
side of your training
+data (found at CORPUS.TARGET) must already be parsed in PTB format.  The 
pipeline will notice that
+it is parsed and will not reparse it.
+
+Parsing is affected by both the `--threads N` and `--jobs N` options.  The 
former runs the parser in
+multithreaded mode, while the latter distributes the runs across as cluster 
(and requires some
+configuration, not yet documented).  The options are mutually exclusive.
+
+Once the parsing is complete, there will be two parsed files:
+
+- `$RUNDIR/data/train/corpus.en.parsed`: this is the mixed-case file that was 
parsed.
+- `$RUNDIR/data/train/corpus.parsed.en`: this is a leaf-lowercased version of 
the above file used for
+  grammar extraction.
+
+## 4. THRAX (grammar extraction) <a id="tm" />
+
+The grammar extraction step takes three pieces of data: (1) the 
source-language training corpus, (2)
+the target-language training corpus (parsed, if an SAMT grammar is being 
extracted), and (3) the
+alignment file.  From these, it computes a synchronous context-free grammar.  
If you already have a
+grammar and wish to skip this step, you can do so passing the grammar with the 
`--grammar
+/path/to/grammar` flag.
+
+The main variable in grammar extraction is Hadoop.  If you have a Hadoop 
installation, simply ensure
+that the environment variable `$HADOOP` is defined, and Thrax will seamlessly 
use it.  If you *do
+not* have a Hadoop installation, the pipeline will roll out out for you, 
running Hadoop in
+standalone mode (this mode is triggered when `$HADOOP` is undefined).  
Theoretically, any grammar
+extractable on a full Hadoop cluster should be extractable in standalone mode, 
if you are patient
+enough; in practice, you probably are not patient enough, and will be limited 
to smaller
+datasets. You may also run into problems with disk space; Hadoop uses a lot 
(use `--tmp
+/path/to/tmp` to specify an alternate place for temporary data; we suggest you 
use a local disk
+partition with tens or hundreds of gigabytes free, and not an NFS partition).  
Setting up your own
+Hadoop cluster is not too difficult a chore; in particular, you may find it 
helpful to install a
+[pseudo-distributed version of 
Hadoop](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html).
+In our experience, this works fine, but you should note the following caveats:
+
+- It is of crucial importance that you have enough physical disks.  We have 
found that having too
+  few, or too slow of disks, results in a whole host of seemingly unrelated 
issues that are hard to
+  resolve, such as timeouts.  
+- NFS filesystems can cause lots of problems.  You should really try to 
install physical disks that
+  are dedicated to Hadoop scratch space.
+
+Here are some flags relevant to Hadoop and grammar extraction with Thrax:
+
+- `--hadoop /path/to/hadoop`
+
+  This sets the location of Hadoop (overriding the environment variable 
`$HADOOP`)
+  
+- `--hadoop-mem MEM` (2g)
+
+  This alters the amount of memory available to Hadoop mappers (passed via the
+  `mapred.child.java.opts` options).
+  
+- `--thrax-conf FILE`
+
+   Use the provided Thrax configuration file instead of the (grammar-specific) 
default.  The Thrax
+   templates are located at 
`$JOSHUA/scripts/training/templates/thrax-TYPE.conf`, where TYPE is one
+   of "hiero" or "samt".
+  
+When the grammar is extracted, it is compressed and placed at 
`$RUNDIR/grammar.gz`.
+
+## <a id="lm" /> 5. Language model
+
+Before tuning can take place, a language model is needed.  A language model is 
always built from the
+target side of the training corpus unless `--no-corpus-lm` is specified.  In 
addition, you can
+provide other language models (any number of them) with the `--lmfile FILE` 
argument.  Other
+arguments are as follows.
+
+-  `--lm` {kenlm (default), berkeleylm}
+
+   This determines the language model code that will be used when decoding.  
These implementations
+   are described in their respective papers (PDFs:
+   [KenLM](http://kheafield.com/professional/avenue/kenlm.pdf),
+   
[BerkeleyLM](http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf)). 
KenLM is written in
+   C++ and requires a pass through the JNI, but is recommended because it 
supports left-state minimization.
+   
+- `--lmfile FILE`
+
+  Specifies a pre-built language model to use when decoding.  This language 
model can be in ARPA
+  format, or in KenLM format when using KenLM or BerkeleyLM format when using 
that format.
+
+- `--lm-gen` {kenlm (default), srilm, berkeleylm}, `--buildlm-mem MEM`, 
`--witten-bell`
+
+  At the tuning step, an LM is built from the target side of the training data 
(unless
+  `--no-corpus-lm` is specified).  This controls which code is used to build 
it.  The default is a
+  KenLM's [lmplz](http://kheafield.com/code/kenlm/estimation/), and is 
strongly recommended.
+  
+  If SRILM is used, it is called with the following arguments:
+  
+        $SRILM/bin/i686-m64/ngram-count -interpolate SMOOTHING -order 5 -text 
TRAINING-DATA -unk -lm lm.gz
+        
+  Where SMOOTHING is `-kndiscount`, or `-wbdiscount` if `--witten-bell` is 
passed to the pipeline.
+  
+  [BerkeleyLM java 
class](http://code.google.com/p/berkeleylm/source/browse/trunk/src/edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText.java)
+  is also available. It computes a Kneser-Ney LM with a constant discounting 
(0.75) and no count
+  thresholding.  The flag `--buildlm-mem` can be used to control how much 
memory is allocated to the
+  Java process.  The default is "2g", but you will want to increase it for 
larger language models.
+  
+  A language model built from the target side of the training data is placed 
at `$RUNDIR/lm.gz`.  
+
+## Interlude: decoder arguments
+
+Running the decoder is done in both the tuning stage and the testing stage.  A 
critical point is
+that you have to give the decoder enough memory to run.  Joshua can be very 
memory-intensive, in
+particular when decoding with large grammars and large language models.  The 
default amount of
+memory is 3100m, which is likely not enough (especially if you are decoding 
with SAMT grammar).  You
+can alter the amount of memory for Joshua using the `--joshua-mem MEM` 
argument, where MEM is a Java
+memory specification (passed to its `-Xmx` flag).
+
+## <a id="tuning" /> 6. TUNING
+
+Two optimizers are provided with Joshua: MERT and PRO (`--tuner {mert,pro}`).  
If Moses is
+installed, you can also use Cherry & Foster's k-best batch MIRA (`--tuner 
mira`, recommended).
+Tuning is run till convergence in the `$RUNDIR/tune/N` directory, where N is 
the tuning instance.
+By default, tuning is run just once, but the pipeline supports running the 
optimizer an arbitrary
+number of times due to [recent 
work](http://www.youtube.com/watch?v=BOa3XDkgf0Y) pointing out the
+variance of tuning procedures in machine translation, in particular MERT.  
This can be activated
+with `--optimizer-runs N`.  Each run can be found in a directory 
`$RUNDIR/tune/N`.
+
+When tuning is finished, each final configuration file can be found at either
+
+    $RUNDIR/tune/N/joshua.config.final
+
+where N varies from 1..`--optimizer-runs`.
+
+## <a id="testing" /> 7. Testing 
+
+For each of the tuner runs, Joshua takes the tuner output file and decodes the 
test set.  If you
+like, you can also apply minimum Bayes-risk decoding to the decoder output 
with `--mbr`.  This
+usually yields about 0.3 - 0.5 BLEU points, but is time-consuming.
+
+After decoding the test set with each set of tuned weights, Joshua computes 
the mean BLEU score,
+writes it to `$RUNDIR/test/final-bleu`, and cats it. It also writes a file
+`$RUNDIR/test/final-times` containing a summary of runtime information. That's 
the end of the pipeline!
+
+Joshua also supports decoding further test sets.  This is enabled by rerunning 
the pipeline with a
+number of arguments:
+
+-   `--first-step TEST`
+
+    This tells the decoder to start at the test step.
+
+-   `--name NAME`
+
+    A name is needed to distinguish this test set from the previous ones.  
Output for this test run
+    will be stored at `$RUNDIR/test/NAME`.
+    
+-   `--joshua-config CONFIG`
+
+    A tuned parameter file is required.  This file will be the output of some 
prior tuning run.
+    Necessary pathnames and so on will be adjusted.
+    
+## <a id="analysis"> 8. ANALYSIS
+
+If you have used the suggested layout, with a number of related runs all 
contained in a common
+directory with sequential numbers, you can use the script 
`$JOSHUA/scripts/training/summarize.pl` to
+display a summary of the mean BLEU scores from all runs, along with the text 
you placed in the run
+README file (using the pipeline's `--readme TEXT` flag).
+
+## COMMON USE CASES AND PITFALLS 
+
+- If the pipeline dies at the "thrax-run" stage with an error like the 
following:
+
+      JOB FAILED (return code 1) 
+      hadoop/bin/hadoop: line 47: 
+      /some/path/to/a/directory/hadoop/bin/hadoop-config.sh: No such file or 
directory 
+      Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/hadoop/fs/FsShell 
+      Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.fs.FsShell 
+      
+  This occurs if the `$HADOOP` environment variable is set but does not point 
to a working
+  Hadoop installation.  To fix it, make sure to unset the variable:
+  
+      # in bash
+      unset HADOOP
+      
+  and then rerun the pipeline with the same invocation.
+
+- Memory usage is a major consideration in decoding with Joshua and 
hierarchical grammars.  In
+  particular, SAMT grammars often require a large amount of memory.  Many 
steps have been taken to
+  reduce memory usage, including beam settings and test-set- and 
sentence-level filtering of
+  grammars.  However, memory usage can still be in the tens of gigabytes.
+
+  To accommodate this kind of variation, the pipeline script allows you to 
specify both (a) the
+  amount of memory used by the Joshua decoder instance and (b) the amount of 
memory required of
+  nodes obtained by the qsub command.  These are accomplished with the 
`--joshua-mem` MEM and
+  `--qsub-args` ARGS commands.  For example,
+
+      pipeline.pl --joshua-mem 32g --qsub-args "-l pvmem=32g -q himem.q" ...
+
+  Also, should Thrax fail, it might be due to a memory restriction. By 
default, Thrax requests 2 GB
+  from the Hadoop server. If more memory is needed, set the memory requirement 
with the
+  `--hadoop-mem` in the same way as the `--joshua-mem` option is used.
+
+- Other pitfalls and advice will be added as it is discovered.
+
+## FEEDBACK 
+
+Please email [email protected] with problems or suggestions.
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/server.md
----------------------------------------------------------------------
diff --git a/5.0/server.md b/5.0/server.md
new file mode 100644
index 0000000..52b2a66
--- /dev/null
+++ b/5.0/server.md
@@ -0,0 +1,30 @@
+---
+layout: default
+category: links
+title: Server mode
+---
+
+The Joshua decoder can be run as a TCP/IP server instead of a POSIX-style 
command-line tool. Clients can concurrently connect to a socket and receive a 
set of newline-separated outputs for a set of newline-separated inputs.
+
+Threading takes place both within and across requests.  Threads from the 
decoder pool are assigned in round-robin manner across requests, preventing 
starvation.
+
+
+# Invoking the server
+
+A running server is configured at invokation time. To start in server mode, 
run `joshua-decoder` with the option `-server-port [PORT]`. Additionally, the 
server can be configured in the same ways as when using the 
command-line-functionality.
+
+E.g.,
+
+    $JOSHUA/bin/joshua-decoder -server-port 10101 -mark-oovs false 
-output-format "%s" -threads 10
+
+## Using the server
+
+To test that the server is working, a set of inputs can be sent to the server 
from the command line. 
+
+The server, as configured in the example above, will then respond to requests 
on port 10101.  You can test it out with the `nc` utility:
+
+    wget -qO - http://cs.jhu.edu/~post/files/pg1023.txt | head -132 | tail -11 
| nc localhost 10101
+
+Since no model was loaded, this will just return the text to you as sent to 
the server.
+
+The `-server-port` option can also be used when creating a [bundled 
configuration](bundle.html) that will be run in server mode.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/thrax.md
----------------------------------------------------------------------
diff --git a/5.0/thrax.md b/5.0/thrax.md
new file mode 100644
index 0000000..a904b23
--- /dev/null
+++ b/5.0/thrax.md
@@ -0,0 +1,14 @@
+---
+layout: default
+category: advanced
+title: Grammar extraction with Thrax
+---
+
+One day, this will hold Thrax documentation, including how to use Thrax, how 
to do grammar
+filtering, and details on the configuration file options.  It will also 
include details about our
+experience setting up and maintaining Hadoop cluster installations, knowledge 
wrought of hard-fought
+sweat and tears.
+
+In the meantime, please bother [Jonny Weese](http://cs.jhu.edu/~jonny/) if 
there is something you
+need to do that you don't understand.  You might also be able to dig up some 
information [on the old
+Thrax page](http://cs.jhu.edu/~jonny/thrax/).

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/tms.md
----------------------------------------------------------------------
diff --git a/5.0/tms.md b/5.0/tms.md
new file mode 100644
index 0000000..68f8732
--- /dev/null
+++ b/5.0/tms.md
@@ -0,0 +1,106 @@
+---
+layout: default
+category: advanced
+title: Building Translation Models
+---
+
+# Build a translation model
+
+Extracting a grammar from a large amount of data is a multi-step process. The 
first requirement is parallel data. The Europarl, Call Home, and Fisher corpora 
all contain parallel translations of Spanish and English sentences.
+
+We will copy (or symlink) the parallel source text files in a subdirectory 
called `input/`.
+
+Then, we concatenate all the training files on each side. The pipeline script 
normally does tokenization and normalization, but in this instance we have a 
custom tokenizer we need to apply to the source side, so we have to do it 
manually and then skip that step using the `pipeline.pl` option `--first-step 
alignment`.
+
+* to tokenize the English data, do
+
+    cat callhome.en europarl.en fisher.en > all.en | 
$JOSHUA/scripts/training/normalize-punctuation.pl en | 
$JOSHUA/scripts/training/penn-treebank-tokenizer.perl | 
$JOSHUA/scripts/lowercase.perl > all.norm.tok.lc.en
+
+The same can be done for the Spanish side of the input data:
+
+    cat callhome.es europarl.es fisher.es > all.es | 
$JOSHUA/scripts/training/normalize-punctuation.pl es | 
$JOSHUA/scripts/training/penn-treebank-tokenizer.perl | 
$JOSHUA/scripts/lowercase.perl > all.norm.tok.lc.es
+
+By the way, an alternative tokenizer is a Twitter tokenizer found in the 
[Jerboa](http://github.com/vandurme/jerboa) project.
+
+The final step in the training data preparation is to remove all examples in 
which either of the language sides is a blank line.
+
+    paste all.norm.tok.lc.es all.norm.tok.lc.en | grep -Pv "^\t|\t$" \
+      | ./splittabs.pl all.norm.tok.lc.noblanks.es all.norm.tok.lc.noblanks.en
+
+contents of `splittabls.pl` by Matt Post:
+
+    #!/usr/bin/perl
+
+    # splits on tab, printing respective chunks to the list of files given
+    # as script arguments
+
+    use FileHandle;
+
+    my @fh;
+    $| = 1;   # don't buffer output
+
+    if (@ARGV < 0) {
+      print "Usage: splittabs.pl < tabbed-file\n";
+      exit;
+    }
+
+    my @fh = map { get_filehandle($_) } @ARGV;
+    @ARGV = ();
+
+    while (my $line = <>) {
+      chomp($line);
+      my (@fields) = split(/\t/,$line,scalar @fh);
+
+      map { print {$fh[$_]} "$fields[$_]\n" } (0..$#fields);
+    }
+
+    sub get_filehandle {
+        my $file = shift;
+
+        if ($file eq "-") {
+            return *STDOUT;
+        } else {
+            local *FH;
+            open FH, ">$file" or die "can't open '$file' for writing";
+            return *FH;
+        }
+    }
+
+Now we can run the pipeline to extract the grammar. Run the following script:
+
+    #!/bin/bash
+
+    # this creates a grammar
+
+    # NEED:
+    # pair
+    # type
+
+    set -u
+
+    pair=es-en
+    type=hiero
+
+    #. ~/.bashrc
+
+    #basedir=$(pwd)
+
+    dir=grammar-$pair-$type
+
+    [[ ! -d $dir ]] && mkdir -p $dir
+    cd $dir
+
+    source=$(echo $pair | cut -d- -f 1)
+    target=$(echo $pair | cut -d- -f 2)
+
+    $JOSHUA/scripts/training/pipeline.pl \
+      --source $source \
+      --target $target \
+      --corpus 
/home/hltcoe/lorland/expts/scale12/model1/input/all.norm.tok.lc.noblanks \
+      --type $type \
+      --joshua-mem 100g \
+      --no-prepare \
+      --first-step align \
+      --last-step thrax \
+      --hadoop $HADOOP \
+      --threads 8 \

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/tutorial.md
----------------------------------------------------------------------
diff --git a/5.0/tutorial.md b/5.0/tutorial.md
new file mode 100644
index 0000000..038db9f
--- /dev/null
+++ b/5.0/tutorial.md
@@ -0,0 +1,174 @@
+---
+layout: default
+category: links
+title: Pipeline tutorial
+---
+
+This document will walk you through using the pipeline in a variety of 
scenarios. Once you've gained a
+sense for how the pipeline works, you can consult the [pipeline 
page](pipeline.html) for a number of
+other options available in the pipeline.
+
+## Download and Setup
+
+Download and install Joshua as described on the [quick start 
page](index.html), installing it under
+`~/code/`. Once you've done that, you should make sure you have the following 
environment variable set:
+
+    export JOSHUA=$HOME/code/joshua-v5.0
+    export JAVA_HOME=/usr/java/default
+
+If you have a Hadoop installation, make sure you've set `$HADOOP` to point to 
it (if not, Joshua
+will roll out a standalone cluster for you). If you'd like to use kbmira for 
tuning, you should also
+install Moses, and define the environment variable `$MOSES` to point to the 
root of its installation.
+
+## A basic pipeline run
+
+For today's experiments, we'll be building a Bengali--English system using 
data included in the
+[Indian Languages Parallel Corpora](/indian-parallel-corpora/). This data was 
collected by taking
+the 100 most-popular Bengali Wikipedia pages and translating them into English 
using Amazon's
+[Mechanical Turk](http://www.mturk.com/). As a warning, many of these pages 
contain material that is
+not typically found in machine translation tutorials.
+
+Download the data and install it somewhere:
+
+    cd ~/data
+    wget -q --no-check -O indian-parallel-corpora.zip 
https://github.com/joshua-decoder/indian-parallel-corpora/archive/master.zip
+    unzip indian-parallel-corpora.zip
+
+Then define the environment variable `$INDIAN` to point to it:
+
+    cd ~/data/indian-parallel-corpora-master
+    export INDIAN=$(pwd)
+    
+### Preparing the data
+
+Inside this tarball is a directory for each language pair. Within each 
language directory is another
+directory named `tok/`, which contains pre-tokenized and normalized versions 
of the data. This was
+done because the normalization scripts provided with Joshua is written in 
scripting languages that
+often have problems properly handling UTF-8 character sets. We will be using 
these tokenized
+versions, and preventing the pipeline from retokenizing using the 
`--no-prepare` flag.
+
+In `$INDIAN/bn-en/tok`, you should see the following files:
+
+    $ ls $INDIAN/bn-en/tok
+    dev.bn-en.bn     devtest.bn-en.bn     dict.bn-en.bn     test.bn-en.en.2
+    dev.bn-en.en.0   devtest.bn-en.en.0   dict.bn-en.en     test.bn-en.en.3
+    dev.bn-en.en.1   devtest.bn-en.en.1   test.bn-en.bn     training.bn-en.bn
+    dev.bn-en.en.2   devtest.bn-en.en.2   test.bn-en.en.0   training.bn-en.en
+    dev.bn-en.en.3   devtest.bn-en.en.3   test.bn-en.en.1
+
+We will now use this data to test the complete pipeline with a single command.
+    
+### Run the pipeline
+
+Create an experiments directory for containing your first experiment:
+
+    mkdir ~/expts/joshua
+    cd ~/expts/joshua
+    
+We will now create the baseline run, using a particular directory structure 
for experiments that
+will allow us to take advantage of scripts provided with Joshua for displaying 
the results of many
+related experiments.
+
+    cd ~/expts/joshua
+    $JOSHUA/bin/pipeline.pl           \
+      --rundir 1                      \
+      --readme "Baseline Hiero run"   \
+      --source bn                     \
+      --target en                     \
+      --corpus $INDIAN/bn-en/tok/training.bn-en \
+      --corpus $INDIAN/bn-en/tok/dict.bn-en     \
+      --tune $INDIAN/bn-en/tok/dev.bn-en        \
+      --test $INDIAN/bn-en/tok/devtest.bn-en    \
+      --lm-order 3
+      
+This will start the pipeline building a Bengali--English translation system 
constructed from the
+training data and a dictionary, tuned against dev, and tested against devtest. 
It will use the
+default values for most of the pipeline: 
[GIZA++](https://code.google.com/p/giza-pp/) for alignment,
+KenLM's `lmplz` for building the language model, Z-MERT for tuning, KenLM with 
left-state
+minimization for representing LM state in the decoder, and so on. We change 
the order of the n-gram
+model to 3 (from its default of 5) because there is not enough data to build a 
5-gram LM.
+
+A few notes:
+
+- This will likely take many hours to run, especially if you don't have a 
Hadoop cluster.
+
+- If you are running on Mac OS X, KenLM's `lmplz` will not build due to the 
absence of static
+  libraries. In that case, you should add the flag `--lm-gen srilm` 
(recommended, if SRILM is
+  installed) or `--lm-gen berkeleylm`.
+
+### Variations
+
+Once that is finished, you will have a baseline model. From there, you might 
wish to try variations
+of the baseline model. Here are some examples of what you could vary:
+
+- Build an SAMT model (`--type samt`), GKHM model (`--type ghkm`), or phrasal 
ITG model (`--type phrasal`) 
+   
+- Use the Berkeley aligner instead of GIZA++ (`--aligner berkeley`)
+   
+- Build the language model with BerkeleyLM (`--lm-gen srilm`) instead of KenLM 
(the default)
+
+- Change the order of the LM from the default of 5 (`--lm-order 4`)
+
+- Tune with MIRA instead of MERT (`--tuner mira`). This requires that Moses is 
installed.
+   
+- Decode with a wider beam (`--joshua-args '-pop-limit 200'`) (the default is 
100)
+
+- Add the provided BN-EN dictionary to the training data (add another 
`--corpus` line, e.g., `--corpus $INDIAN/bn-en/dict.bn-en`)
+
+To do this, we will create new runs that partially reuse the results of 
previous runs. This is
+possible by doing two things: (1) incrementing the run directory and providing 
an updated README
+note; (2) telling the pipeline which of the many steps of the pipeline to 
begin at; and (3)
+providing the needed dependencies.
+
+# A second run
+
+Let's begin by changing the tuner, to see what effect that has. To do so, we 
change the run
+directory, tell the pipeline to start at the tuning step, and provide the 
needed dependencies:
+
+    $JOSHUA/bin/pipeline.pl           \
+      --rundir 2                      \
+      --readme "Tuning with MIRA"     \
+      --source bn                     \
+      --target en                     \
+      --corpus $INDIAN/bn-en/tok/training.bn-en \
+      --tune $INDIAN/bn-en/tok/dev.bn-en        \
+      --test $INDIAN/bn-en/tok/devtest.bn-en    \
+      --first-step tune \
+      --tuner mira \
+      --grammar 1/grammar.gz \
+      --no-corpus-lm \
+      --lmfile 1/lm.gz
+      
+ Here, we have essentially the same invocation, but we have told the pipeline 
to use a different
+ MIRA, to start with tuning, and have provided it with the language model file 
and grammar it needs
+ to execute the tuning step. 
+ 
+ Note that we have also told it not to build a language model. This is 
necessary because the
+ pipeline always builds an LM on the target side of the training data, if 
provided, but we are
+ supplying the language model that was already built. We could equivalently 
have removed the
+ `--corpus` line.
+ 
+## Changing the model type
+
+Let's compare the Hiero model we've already built to an SAMT model. We have to 
reextract the
+grammar, but can reuse the alignments and the language model:
+
+    $JOSHUA/bin/pipeline.pl           \
+      --rundir 3                      \
+      --readme "Baseline SAMT model"  \
+      --source bn                     \
+      --target en                     \
+      --corpus $INDIAN/bn-en/tok/training.bn-en \
+      --tune $INDIAN/bn-en/tok/dev.bn-en        \
+      --test $INDIAN/bn-en/tok/devtest.bn-en    \
+      --alignment 1/alignments/training.align   \
+      --first-step parse \
+      --no-corpus-lm \
+      --lmfile 1/lm.gz
+
+See [the pipeline script page](pipeline.html#steps) for a list of all the 
steps.
+
+## Analyzing the results
+
+We now have three runs, in subdirectories 1, 2, and 3. We can display summary 
results from them
+using the `$JOSHUA/scripts/training/summarize.pl` script.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/5.0/zmert.md
----------------------------------------------------------------------
diff --git a/5.0/zmert.md b/5.0/zmert.md
new file mode 100644
index 0000000..d6a5d3c
--- /dev/null
+++ b/5.0/zmert.md
@@ -0,0 +1,83 @@
+---
+layout: default
+category: advanced
+title: Z-MERT
+---
+
+This document describes how to manually run the ZMERT module.  ZMERT is 
Joshua's minimum error-rate
+training module, written by Omar F. Zaidan.  It is easily adapted to drop in 
different decoders, and
+was also written so as to work with different objective functions (other than 
BLEU).
+
+((Section (1) in `$JOSHUA/examples/ZMERT/README_ZMERT.txt` is an expanded 
version of this section))
+
+Z-MERT, can be used by launching the driver program (`ZMERT.java`), which 
expects a config file as
+its main argument.  This config file can be used to specify any subset of 
Z-MERT's 20-some
+parameters.  For a full list of those parameters, and their default values, 
run ZMERT with a single
+-h argument as follows:
+
+    java -cp $JOSHUA/bin joshua.zmert.ZMERT -h
+
+So what does a Z-MERT config file look like?
+
+Examine the file `examples/ZMERT/ZMERT_config_ex2.txt`.  You will find that it
+specifies the following "main" MERT parameters:
+
+    (*) -dir dirPrefix:         working directory
+    (*) -s sourceFile:          source sentences (foreign sentences) of the 
MERT dataset
+    (*) -r refFile:             target sentences (reference translations) of 
the MERT dataset
+    (*) -rps refsPerSen:        number of reference translations per sentence
+    (*) -p paramsFile:          file containing parameter names, initial 
values, and ranges
+    (*) -maxIt maxMERTIts:      maximum number of MERT iterations
+    (*) -ipi initsPerIt:        number of intermediate initial points per 
iteration
+    (*) -cmd commandFile:       name of file containing commands to run the 
decoder
+    (*) -decOut decoderOutFile: name of the output file produced by the decoder
+    (*) -dcfg decConfigFile:    name of decoder config file
+    (*) -N N:                   size of N-best list (per sentence) generated 
in each MERT iteration
+    (*) -v verbosity:           output verbosity level (0-2; higher value => 
more verbose)
+    (*) -seed seed:             seed used to initialize the random number 
generator
+
+(Note that the `-s` parameter is only used if Z-MERT is running Joshua as an
+ internal decoder.  If Joshua is run as an external decoder, as is the case in
+ this README, then this parameter is ignored.)
+
+To test Z-MERT on the 100-sentence test set of example2, provide this config
+file to Z-MERT as follows:
+
+    java -cp bin joshua.zmert.ZMERT -maxMem 500 
examples/ZMERT/ZMERT_config_ex2.txt > examples/ZMERT/ZMERT_example/ZMERT.out
+
+This will run Z-MERT for a couple of iterations on the data from the example2
+folder.  (Notice that we have made copies of the source and reference files
+from example2 and renamed them as src.txt and ref.* in the MERT_example folder,
+just to have all the files needed by Z-MERT in one place.)  Once the Z-MERT run
+is complete, you should be able to inspect the log file to see what kinds of
+things it did.  If everything goes well, the run should take a few minutes, of
+which more than 95% is time spent by Z-MERT waiting on Joshua to finish
+decoding the sentences (once per iteration).
+
+The output file you get should be equivalent to `ZMERT.out.verbosity1`.  If you
+rerun the experiment with the verbosity (-v) argument set to 2 instead of 1,
+the output file you get should be equivalent to `ZMERT.out.verbosity2`, which 
has
+more interesting details about what Z-MERT does.
+
+Notice the additional `-maxMem` argument.  It tells Z-MERT that it should not
+persist to use up memory while the decoder is running (during which time Z-MERT
+would be idle).  The 500 tells Z-MERT that it can only use a maximum of 500 MB.
+For more details on this issue, see section (4) in Z-MERT's README.
+
+A quick note about Z-MERT's interaction with the decoder.  If you examine the
+file `decoder_command_ex2.txt`, which is provided as the commandFile (`-cmd`)
+argument in Z-MERT's config file, you'll find it contains the command one would
+use to run the decoder.  Z-MERT launches the commandFile as an external
+process, and assumes that it will launch the decoder to produce translations.
+(Make sure that commandFile is executable.)  After launching this external
+process, Z-MERT waits for it to finish, then uses the resulting output file for
+parameter tuning (in addition to the output files from previous iterations).
+The command file here only has a single command, but your command file could
+have multiple lines.  Just make sure the command file itself is executable.
+
+Notice that the Z-MERT arguments `configFile` and `decoderOutFile` (`-cfg` and
+`-decOut`) must match the two Joshua arguments in the commandFile's (`-cmd`) 
single
+command.  Also, the Z-MERT argument for N must match the value for `top_n` in
+Joshua's config file, indicated by the Z-MERT argument configFile (`-cfg`).
+
+For more details on Z-MERT, refer to `$JOSHUA/examples/ZMERT/README_ZMERT.txt`

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/advanced.md
----------------------------------------------------------------------
diff --git a/6.0/advanced.md b/6.0/advanced.md
new file mode 100644
index 0000000..4997e73
--- /dev/null
+++ b/6.0/advanced.md
@@ -0,0 +1,7 @@
+---
+layout: default6
+category: links
+title: Advanced features
+---
+
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/bundle.md
----------------------------------------------------------------------
diff --git a/6.0/bundle.md b/6.0/bundle.md
new file mode 100644
index 0000000..f433172
--- /dev/null
+++ b/6.0/bundle.md
@@ -0,0 +1,100 @@
+---
+layout: default6
+category: links
+title: Building a language pack
+---
+
+*The information in this page applies to Joshua 6.0.3 and greater*.
+
+Joshua distributes [language packs](/language-packs), which are models
+that have been trained and tuned for particular language pairs. You
+can easily create your own language pack after you have trained and
+tuned a model using the provided
+`$JOSHUA/scripts/support/run-bundler.py` script, which gathers files
+from a pipeline training directory and bundles them together for easy
+distribution and release.
+
+The script takes just two mandatory arguments in the following order:
+
+1.  The path to the Joshua configuration file to base the bundle
+    on. This file should contain the tuned weights from the tuning run, so
+    you can use either the final tuned file from the tuning run
+    (`tune/joshua.config.final`) or from the test run
+    (`test/model/joshua.config`).
+1.  The directory to place the language pack in. If this directory
+    already exists, the script will die, unless you also pass `--force`.
+
+In addition, there are a number of other arguments that may be important.
+
+- `--root /path/to/root`. If file paths in the Joshua config file are
+   not absolute, you need to provide relative root. If you specify a
+   tuned pipeline file (such as `tune/joshua.config.final` above), the
+   paths should all be absolute. If you instead provide a config file
+   from a previous run bundle (e.g., `test/model/joshua.config`), the
+   bundle directory above is the relative root.
+
+- The config file options that are used in the pipeline are likely not
+  the ones you want if you release a model. For example, the tuning
+  configuration file contains options that tell Joshua to output 300
+  translation candidates for each sentence (`-top-n 300`) and to
+  include lots of detail about each translation (`-output-format '%i
+  ||| %s ||| %f ||| %c'`).  Because of this, you will want to tell the
+  run bundler to change many of the config file options to be more
+  geared towards human-readable output. The default copy-config
+  options are options are `-top-n 0 -output-format %S -mark-oovs
+  false`, which accomplishes exactly this (human readability).
+  
+- A very important issue has to do with the translation model (the
+  "TM", also sometimes called the grammar or phrase table). The
+  translation model can be very large, so that it takes a long time to
+  load and to [pack](packing.html). To reduce this time during model
+  training, the translation model is filtered against the tuning and
+  testing data in the pipeline, and these filtered models will be what
+  is listed in the source config files. However, when exporting a
+  model for use as a language pack, you need to export the full model
+  instead of the filtered one so as to maximize your coverage on new
+  test data. The `--tm` parameter is used to accomplish this; it takes
+  an argument specifying the path to the full model. If you would
+  additionally like the large model to be [packed](packing.html) (this
+  is recommended; it reformats the TM so that it can be quickly loaded
+  at run time), you can use `--pack-tm` instead. You can only pack one
+  TM (but typically there is only TM anyway). Multiple `--tm`
+  parameters can be passed; they will replace TMs found in the config
+  file in the order they are found.
+
+Here is an example invocation for packing a hierarchical model using
+the final tuned Joshua config file:
+
+    ./run-bundler.py \
+      --force --verbose \
+      /path/to/rundir/tune/joshua.config.final \
+      language-pack-YYYY-MM-DD \
+      --root /path/to/rundir \
+      --pack-tm /path/to/rundir/grammar.gz \
+      --copy-config-options \ 
+        '-top-n 1 -output-format %S -mark-oovs false' \
+      --server-port 5674
+
+The copy config options tell the decoder to present just the
+single-best (`-top-n 0`) translated output string that has been
+heuristically capitalized (`-output-format %S`), to not append `_OOV`
+to OOVs (`-mark-oovs false`), and to use the translation model
+`/path/to/rundir/grammar.gz` as the main translation model, packing it
+before placing it in the bundle. Note that these arguments to
+`--copy-config` are the default, so you could leave this off entirely.
+See [this page](decoder.html) for a longer list of decoder options.
+
+This command is a slight variation used for phrase-based models, which
+instead takes the test-set Joshua config (the result is the same):
+
+    ./run-bundler.py \
+      --force --verbose \
+      /path/to/rundir/test/model/joshua.config \
+      --root /path/to/rundir/test/model \
+      language-pack-YYYY-MM-DD \
+      --pack-tm /path/to/rundir/model/phrase-table.gz \
+      --server-port 5674
+
+In both cases, a new directory `language-pack-YYYY-MM-DD` will be
+created along with a README and a number of support files.
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/decoder.md
----------------------------------------------------------------------
diff --git a/6.0/decoder.md b/6.0/decoder.md
new file mode 100644
index 0000000..e8dc8c9
--- /dev/null
+++ b/6.0/decoder.md
@@ -0,0 +1,385 @@
+---
+layout: default6
+category: links
+title: Decoder configuration parameters
+---
+
+Joshua configuration parameters affect the runtime behavior of the decoder 
itself.  This page
+describes the complete list of these parameters and describes how to invoke 
the decoder manually.
+
+To run the decoder, a convenience script is provided that loads the necessary 
Java libraries.
+Assuming you have set the environment variable `$JOSHUA` to point to the root 
of your installation,
+its syntax is:
+
+    $JOSHUA/bin/decoder [-m memory-amount] [-c config-file 
other-joshua-options ...]
+
+The `-m` argument, if present, must come first, and the memory specification 
is in Java format
+(e.g., 400m, 4g, 50g).  Most notably, the suffixes "m" and "g" are used for 
"megabytes" and
+"gigabytes", and there cannot be a space between the number and the unit.  The 
value of this
+argument is passed to Java itself in the invocation of the decoder, and the 
remaining options are
+passed to Joshua.  The `-c` parameter has special import because it specifies 
the location of the
+configuration file.
+
+The Joshua decoder works by reading from STDIN and printing translations to 
STDOUT as they are
+received, according to a number of [output options](#output).  If no run-time 
parameters are
+specified (e.g., no translation model), sentences are simply pushed through 
untranslated.  Blank
+lines are similarly pushed through as blank lines, so as to maintain 
parallelism with the input.
+
+Parameters can be provided to Joshua via a configuration file and from the 
command
+line.  Command-line arguments override values found in the configuration file. 
 The format for
+configuration file parameters is
+
+    parameter = value
+
+Command-line options are specified in the following format
+
+    -parameter value
+
+Values are one of four types (which we list here mostly to call attention to 
the boolean format):
+
+- STRING, an arbitrary string (no spaces)
+- FLOAT, a floating-point value
+- INT, an integer
+- BOOLEAN, a boolean value.  For booleans, `true` evaluates to true, and all 
other values evaluate
+  to false.  For command-line options, the value may be omitted, in which case 
it evaluates to
+  true.  For example, the following are equivalent:
+
+      $JOSHUA/bin/decoder -mark-oovs true
+      $JOSHUA/bin/decoder -mark-oovs
+
+## Joshua configuration file
+
+In addition to the decoder parameters described below, the configuration file 
contains the model
+feature weights.  These weights are distinguished from runtime parameters in 
that they are delimited
+by a space instead of an equals sign. They take the following
+format, and by convention are placed at the end of the configuration file:
+
+    lm_0 4.23
+    tm_pt_0 -0.2
+    OOVPenalty -100
+   
+Joshua can make use of thousands of features, which are described in further 
detail in the
+[feature file](features.html).
+
+## Joshua decoder parameters
+
+This section contains a list of the Joshua run-time parameters.  An important 
note about the
+parameters is that they are collapsed to canonical form, in which dashes (-) 
and underscores (-) are
+removed and case is converted to lowercase.  For example, the following 
parameter forms are
+equivalent (either in the configuration file or from the command line):
+
+    {top-n, topN, top_n, TOP_N, t-o-p-N}
+    {poplimit, pop-limit, pop-limit, popLimit,PoPlImIt}
+
+This basically defines equivalence classes of parameters, and relieves you of 
the task of having to
+remember the exact format of each parameter.
+
+In what follows, we group the configuration parameters in the following groups:
+
+- [General options](#general)
+- [Pruning](#pruning)
+- [Translation model options](#tm)
+- [Language model options](#lm)
+- [Output options](#output)
+- [Alternate modes of operation](#modes)
+
+<a id="general" />
+
+### General decoder options
+
+- `c`, `config` --- *NULL*
+
+   Specifies the configuration file from which Joshua options are loaded.  
This feature is unique in
+   that it must be specified from the command line (obviously).
+
+- `amortize` --- *true*
+
+  When true, specifies that sorting of the rule lists at each trie node in the 
grammar should be
+  delayed until the trie node is accessed. When false, all such nodes are 
sorted before decoding
+  even begins. Setting to true results in slower per-sentence decoding, but 
allows the decoder to
+  begin translating almost immediately (especially with large grammars).
+
+- `server-port` --- *0*
+
+  If set to a nonzero value, Joshua will start a multithreaded TCP/IP server 
on the specified
+  port. Clients can connect to it directly through programming APIs or 
command-line tools like
+  `telnet` or `nc`.
+  
+      $ $JOSHUA/bin/decoder -m 30g -c /path/to/config/file -server-port 8723
+      ...
+      $ cat input.txt | nc localhost 8723 > results.txt
+
+- `maxlen` --- *200*
+
+  Input sentences longer than this are truncated.
+
+- `feature-function`
+
+  Enables a particular feature function. See the [feature function 
page](features.html) for more information.
+
+- `oracle-file` --- *NULL*
+
+  The location of a set of oracle reference translations, parallel to the 
input.  When present,
+  after producing the hypergraph by decoding the input sentence, the oracle is 
used to rescore the
+  translation forest with a BLEU approximation in order to extract the 
oracle-translation from the
+  forest.  This is useful for obtaining an (approximation to an) upper bound 
on your translation
+  model under particular search settings.
+
+- `default-nonterminal` --- *"X"*
+
+   This is the nonterminal symbol assigned to out-of-vocabulary (OOV) items. 
Joshua assigns this
+   label to every word of the input, in fact, so that even known words can be 
translated as OOVs, if
+   the model prefers them. Usually, a very low weight on the `OOVPenalty` 
feature discourages their
+   use unless necessary.
+
+- `goal-symbol` --- *"GOAL"*
+
+   This is the symbol whose presence in the chart over the whole input span 
denotes a successful
+   parse (translation).  It should match the LHS nonterminal in your glue 
grammar.  Internally,
+   Joshua represents nonterminals enclosed in square brackets (e.g., 
"[GOAL]"), which you can
+   optionally supply in the configuration file.
+
+- `true-oovs-only` --- *false*
+
+  By default, Joshua creates an OOV entry for every word in the source 
sentence, regardless of
+  whether it is found in the grammar.  This allows every word to be pushed 
through untranslated
+  (although potentially incurring a high cost based on the `OOVPenalty` 
feature).  If this option is
+  set, then only true OOVs are entered into the chart as OOVs. To determine 
"true" OOVs, Joshua
+  examines the first level of the grammar trie for each word of the input 
(this isn't a perfect
+  heuristic, since a word could be present only in deeper levels of the trie).
+
+- `threads`, `num-parallel-decoders` --- *1*
+
+  This determines how many simultaneous decoding threads to launch.  
+
+  Outputs are assembled in order and Joshua has to hold on to the complete 
target hypergraph until
+  it is ready to be processed for output, so too many simultaneous threads 
could result in lots of
+  memory usage if a long sentence results in many sentences being queued up.  
We have run Joshua
+  with as many as 64 threads without any problems of this kind, but it's 
useful to keep in the back
+  of your mind.
+  
+- `weights-file` --- NULL
+
+  Weights are appended to the end of the Joshua configuration file, by 
convention. If you prefer to
+  put them in a separate file, you can do so, and point to the file with this 
parameter.
+
+### Pruning options <a id="pruning" />
+
+- `pop-limit` --- *100*
+
+  The number of cube-pruning hypotheses that are popped from the candidates 
list for each span of
+  the input.  Higher values result in a larger portion of the search space 
being explored at the
+  cost of an increased search time. For exhaustive search, set `pop-limit` to 
0.
+
+- `filter-grammar` --- false
+
+  Set to true, this enables dynamic sentence-level filtering. For each 
sentence, each grammar is
+  filtered at runtime down to rules that can be applied to the sentence under 
consideration. This
+  takes some time (which we haven't thoroughly quantified), but can result in 
the removal of many
+  rules that are only partially applicable to the sentence.
+
+- `constrain-parse` --- *false*
+- `use_pos_labels` --- *false*
+
+  *These features are not documented.*
+
+### Translation model options <a id="tm" />
+
+Joshua supports any number of translation models. Conventionally, two are 
supplied: the main grammar
+containing translation rules, and the glue grammar for patching things 
together. Internally, Joshua
+doesn't distinguish between the roles of these grammars; they are treated 
differently only in that
+they typically have different span limits (the maximum input width they can be 
applied to).
+
+Grammars are instantiated with config file lines of the following form:
+
+    tm = TYPE OWNER SPAN_LIMIT FILE
+
+* `TYPE` is the grammar type, which must be set to "thrax". 
+* `OWNER` is the grammar's owner, which defines the set of [feature 
weights](features.html) that
+  apply to the weights found in each line of the grammar (using different 
owners allows each grammar
+  to have different sets and numbers of weights, while sharing owners allows 
weights to be shared
+  across grammars).
+* `SPAN_LIMIT` is the maximum span of the input that rules from this grammar 
can be applied to. A
+  span limit of 0 means "no limit", while a span limit of -1 means that rules 
from this grammar must
+  be anchored to the left side of the sentence (index 0).
+* `FILE` is the path to the file containing the grammar. If the file is a 
directory, it is assumed
+  to be [packed](packed.html). Only one packed grammar can currently be used 
at a time.
+
+For reference, the following two translation model lines are used by the 
[pipeline](pipeline.html):
+
+    tm = thrax pt 20 /path/to/packed/grammar
+    tm = thrax glue -1 /path/to/glue/grammar
+
+### Language model options <a id="lm" />
+
+Joshua supports any number of language models. With Joshua 6.0, these
+are just regular feature functions:
+
+    feature-function = LanguageModel -lm_file /path/to/lm/file -lm_order N 
-lm_type TYPE
+    feature-function = StateMinimizingLanguageModel -lm_file /path/to/lm/file 
-lm_order N -lm_type TYPE
+
+`LanguageModel` is a generic language model, supporting types 'kenlm'
+(the default) and 'berkeleylm'. `StateMinimizingLanguageModel`
+implements LM state minimization to reduce the size of context n-grams
+where appropriate
+([Li and Khudanpur, 2008](http://www.aclweb.org/anthology/W08-0402.pdf);
+[Heafield et al., 2013](https://aclweb.org/anthology/N/N13/N13-1116.pdf)). This
+is currently only supported by KenLM, so the `-lm_type` option is not
+available here.
+
+The other key/value pairs are defined as follows:
+
+* `lm_type`: one of "kenlm" "berkeleylm"
+* `lm_order`: the order of the language model
+* `lm_file`: the path to the language model file.  All language model
+   types support the standard ARPA format.  Additionally, if the LM
+   type is "kenlm", this file can be compiled into KenLM's compiled
+   format (using the program at `$JOSHUA/bin/build_binary`); if the
+   the LM type is "berkeleylm", it can be compiled by following the
+   directions in
+   `$JOSHUA/src/joshua/decoder/ff/lm/berkeley_lm/README`. The
+   [pipeline](pipeline.html) will automatically compile either type.
+
+For each language model, you need to specify a feature weight in the following 
format:
+
+    lm_0 WEIGHT
+    lm_1 WEIGHT
+    ...
+
+where the indices correspond to the order of the language model declaration 
lines.
+
+### Output options <a id="output" />
+
+- `output-format` *New in 5.0*
+
+  Joshua prints a lot of information to STDERR (making this more granular is 
on the TODO
+  list). Output to STDOUT is reserved for decoder translations, and is 
controlled by the
+
+   - `%i`: the sentence number (0-indexed)
+
+   - `%e`: the source sentence
+
+   - `%s`: the translated sentence
+
+   - `%S`: the translated sentence, with some basic capitalization and 
denomralization. e.g.,
+
+         $ echo "Â¿ who you lookin' at , mr. ?" | $JOSHUA/bin/decoder 
-output-format "%S" -mark-oovs false 2> /dev/null 
+         Â¿Who you lookin' at, Mr.? 
+
+   - `%t`: the target-side tree projection, all printed on one line (PTB style)
+   
+   - `%d`: the synchronous derivation, with each rules printed indented on 
their own lines
+
+   - `%f`: the list of feature values (as name=value pairs)
+
+   - `%c`: the model cost
+
+   - `%w`: the weight vector (unimplemented)
+
+   - `%a`: the alignments between source and target words (currently broken 
for hierarchical mode)
+
+  The default value is
+
+      output-format = %i ||| %s ||| %f ||| %c
+      
+  i.e.,
+
+      input ID ||| translation ||| model scores ||| score
+
+- `top-n` --- *300*
+
+  The number of translation hypotheses to output, sorted in decreasing order 
of model score
+
+- `use-unique-nbest` --- *true*
+
+  When constructing the n-best list for a sentence, skip hypotheses whose 
string has already been
+  output.
+
+- `escape-trees` --- *false*
+
+- `include-align-index` --- *false*
+
+  Output the source words indices that each target word aligns to.
+
+- `mark-oovs` --- *false*
+
+  if `true`, this causes the text "_OOV" to be appended to each untranslated 
word in the output.
+
+- `visualize-hypergraph` --- *false*
+
+  If set to true, a visualization of the hypergraph will be displayed, though 
you will have to
+  explicitly include the relevant jar files.  See the example usage in
+  `$JOSHUA/examples/tree_visualizer/`, which contains a demonstration of a 
source sentence,
+  translation, and synchronous derivation.
+
+- `dump-hypergraph` --- ""
+
+  This feature directs that the hypergraph should be written to disk for each 
input sentence. If
+  set, the value should contain the string "%d", which is replaced with the 
sentence number. For
+  example,
+  
+      cat input.txt | $JOSHUA/bin/decoder -dump-hypergraph hgs/%d.txt
+
+  Note that the output directory must exist.
+
+  TODO: revive the
+  [discussion on a common hypergraph 
format](http://aclweb.org/aclwiki/index.php?title=Hypergraph_Format)
+  on the ACL Wiki and support that format.
+
+### Lattice decoding
+
+In addition to regular sentences, Joshua can decode weighted lattices encoded 
in
+[the PLF format](http://www.statmt.org/moses/?n=Moses.WordLattices), except 
that path costs should
+be listed as <b>log probabilities</b> instead of probabilities.  Lattice 
decoding was originally
+added by Lane Schwartz and [Chris Dyer](http://www.cs.cmu.edu/~cdyer/).
+
+Joshua will automatically detect whether the input sentence is a regular 
sentence (the usual case)
+or a lattice.  If a lattice, a feature will be activated that accumulates the 
cost of different
+paths through the lattice.  In this case, you need to ensure that a weight for 
this feature is
+present in [your model file](decoder.html). The [pipeline](pipeline.html) will 
handle this
+automatically, or if you are doing this manually, you can add the line
+
+    SourcePath COST
+    
+to your Joshua configuration file.    
+
+Lattices must be listed one per line.
+
+### Alternate modes of operation <a id="modes" />
+
+In addition to decoding input sentences in the standard way, Joshua supports 
both *constrained
+decoding* and *synchronous parsing*. In both settings, both the source and 
target sides are provided
+as input, and the decoder finds a derivation between them.
+
+#### Constrained decoding
+
+To enable constrained decoding, simply append the desired target string as 
part of the input, in
+the following format:
+
+    source sentence ||| target sentence
+
+Joshua will translate the source sentence constrained to the target sentence. 
There are a few
+caveats:
+
+   * Left-state minimization cannot be enabled for the language model
+
+   * A heuristic is used to constrain the derivation (the LM state must match 
against the
+     input). This is not a perfect heuristic, and sometimes results in 
analyses that are not
+     perfectly constrained to the input, but have extra words.
+
+#### Synchronous parsing
+
+Joshua supports synchronous parsing as a two-step sequence of monolingual 
parses, as described in
+Dyer (NAACL 2010) ([PDF](http://www.aclweb.org/anthology/N10-1033â.pdf)). To 
enable this:
+
+   - Set the configuration parameter `parse = true`.
+
+   - Remove all language models from the input file 
+
+   - Provide input in the following format:
+
+          source sentence ||| target sentence
+
+You may also wish to display the synchronouse parse tree (`-output-format %t`) 
and the alignment
+(`-show-align-index`).
+

[16/18] incubator-joshua-site git commit: Initial import of joshua-decoder.github.com site to Apache

Reply via email to