[15/18] incubator-joshua-site git commit: Initial import of joshua-decoder.github.com site to Apache

lewismc Mon, 04 Apr 2016 22:13:25 -0700

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/faq.md
----------------------------------------------------------------------
diff --git a/6.0/faq.md b/6.0/faq.md
new file mode 100644
index 0000000..cc06b11
--- /dev/null
+++ b/6.0/faq.md
@@ -0,0 +1,161 @@
+---
+layout: default6
+category: help
+title: Frequently Asked Questions
+---
+
+Solutions to common problems will be posted here as we become aware of
+them.  If you need help with something, please check
+[our support group](https://groups.google.com/forum/#!forum/joshua_support)
+for a solution, or
+[post a new 
question](https://groups.google.com/forum/#!newtopic/joshua_support).
+
+### I get a message stating: "no ken in java.library.path"
+
+This occurs when [KenLM](https://kheafield.com/code/kenlm/) failed to
+build. This can occur for a number of reasons:
+   
+- [Boost](http://www.boost.org/) isn't installed. Boost is
+  available through most package management tools, so try that
+  first. You can also build it from source.
+
+- Boost is installed, but not in your path. The easiest solution is
+  to add the boost library directory to your `$LD_LIBRARY_PATH`
+  environment variable. You can also edit the file
+  `$JOSHUA/src/joshua/decoder/ff/lm/kenlm/Makefile` and define
+  `BOOST_ROOT` to point to your boost location. Then rebuild KenLM
+  with the command
+  
+      ant -f $JOSHUA/build.xml kenlm
+
+- You have run into boost's weird naming of multi-threaded
+  libraries. For some reason, boost libraries sometimes have a
+  `-mt` extension applied when they are built with multi-threaded
+  support. This will cause the linker to fail, since it is looking
+  for, e.g., `-lboost_system` instead of `-lboost_system-mt`. Edit
+  the same Makefile as above and uncomment the `BOOST_MT = -mt`
+  line, then try to compile again with
+  
+      ant -f $JOSHUA.build.xml kenlm
+
+You may find the following reference URLs to be useful.
+
+    https://groups.google.com/forum/#!topic/joshua_support/SiGO41tkpsw
+    
http://stackoverflow.com/questions/12583080/c-library-in-using-boost-library
+
+
+### How do I make Joshua produce better results?
+
+One way is to add a larger language model. Build on Gigaword, news
+crawl data, etc. `lmplz` makes it easy to build and efficient to
+represent (especially if you compress it with `build_binary). To
+include it in Joshua, there are two ways:
+
+- *Pipeline*. By default, Joshua's pipeline builds a language
+   model on the target side of your parallel training data. But
+   Joshua can decode with any number of additional language models
+   as well. So you can build a language model separately,
+   presumably on much more data (since you won't be constrained
+   only to one side of parallel data, which is much more scarce
+   than monolingual data). Once you've built extra language models
+   and compiled them with KenLM's `build_binary` script, you can
+   tell the pipeline to use them with any number of `--lmfile
+   /path/to/lm/file` flags.
+
+- *Joshua* (directly).
+      [This file](http://localhost:4000/6.0/file-formats.html)
+      documents the Joshua configuration file format.
+
+### I have already run the pipeline once. How do I run it again, skipping the 
early stages and just retuning the model?
+
+You would need to do this if, for example, you added a language
+model, or changed some other parameter (e.g., an improvement to the
+decoder). To do this, follow the following steps:
+
+- Re-run the pipeline giving it a new `--rundir N+1` (where `N` is the last
+  run, and `N+1` is a new, non-existent directory). 
+- Give it all the other flags that you gave before, such as the
+  tuning data, testing data, source and target flags, etc. You
+  don't have to give it the training data.
+- Tell it to start at the tuning step with `--first-step TUNE`
+- Tell it where all of your language model files are with `--lmfile
+  /path/to/lm` lines. You also have to tell it where the main
+  language model is, which is usually `--lmfile N/lm.kenlm` (paths
+  are relative to the directory above the run directory.
+- Tell it where the main grammar is, e.g., `--grammar
+  N/grammar.gz`. If the tuning and test data hasn't changed, you
+  can also point it to the filtered and packed versions to save a
+  little time using `--tune-grammar N/data/tune/grammar.packed` and
+  `--test-grammar N/data/test/grammar.packed`, where `N` here again
+  is the previous run (or some other run; it can be anywhere).
+
+Here's an example. Let's say you ran a full pipeline as run 1, and
+now added a new language model and want to see how it affects the
+decoder. Your first run might have been invoked like this:
+
+    $JOSHUA/scripts/training/pipeline.pl \
+      --rundir 1 \
+      --readme "Baseline French--English Europarl hiero system" \
+      --corpus /path/to/europarl \
+      --tune /path/to/europarl/tune \
+      --test /path/to/europarl/test \
+      --source fr \
+      --target en \
+      --threads 8 \
+      --joshua-mem 30g \
+      --tuner mira \
+      --type hiero \
+      --aligner berkeley
+
+Your new run will look like this:
+
+    $JOSHUA/scripts/training/pipeline.pl \
+      --rundir 2 \
+      --readme "Adding in a huge language model" \
+      --tune /path/to/europarl/tune \
+      --test /path/to/europarl/test \
+      --source fr \
+      --target en \
+      --threads 8 \
+      --joshua-mem 30g \
+      --tuner mira \
+      --type hiero \
+      --aligner berkeley \
+      --first-step TUNE \
+      --lmfile 1/lm.kenlm \
+      --lmfile /path/to/huge/new/lm \
+      --tune-grammar 1/data/tune/grammar.packed \
+      --test-grammar 1/data/test/grammar.packed
+
+Notice the changes: we removed the `--corpus` (though it would have
+been fine to have left it, it would have just been skipped),
+specified the first step, changed the run directory and README
+comments, and pointed to the grammars and *both* language model files.
+
+How can I enable specific feature functions?
+
+Let's say you created a new feature function, `OracleFeature`, and
+you want to enable it. You can do this in two ways. Through the
+pipeline, simply pass it the argument `--joshua-args "list of
+joshua args"`. These will then be passed to the decoder when it is
+invoked. You can enable your feature functions, then using
+something like
+
+    $JOSHUA/bin/pipeline.pl --joshua-args '-feature-function OracleFeature'   
+
+If you call the decoder directly, you can just put that line in
+the configuration file, e.g.,
+
+    feature-function = OracleFeature
+    
+or you can pass it directly to Joshua on the command line using
+the standard notation, e.g.,
+
+    $JOSHUA/bin/joshua-decoder -feature-function OracleFeature
+    
+These could be stacked, e.g.,
+    
+    $JOSHUA/bin/joshua-decoder -feature-function OracleFeature \
+        -feature-function MagicFeature \
+        -feature-function MTSolverFeature \
+        ...


http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/features.md
----------------------------------------------------------------------
diff --git a/6.0/features.md b/6.0/features.md
new file mode 100644
index 0000000..f9406a9
--- /dev/null
+++ b/6.0/features.md
@@ -0,0 +1,6 @@
+---
+layout: default6
+title: Features
+---
+
+Joshua 5.0 uses a sparse feature representation to encode features internally.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/file-formats.md
----------------------------------------------------------------------
diff --git a/6.0/file-formats.md b/6.0/file-formats.md
new file mode 100644
index 0000000..dbebe55
--- /dev/null
+++ b/6.0/file-formats.md
@@ -0,0 +1,72 @@
+---
+layout: default6
+category: advanced
+title: Joshua file formats
+---
+This page describes the formats of Joshua configuration and support files.
+
+## Translation models (grammars)
+
+Joshua supports two grammar file formats: a text-based version (also used by 
Hiero, shared by
+[cdec](), and supported by [hierarchical Moses]()), and an efficient
+[packed representation](packing.html) developed by [Juri 
Ganitkevich](http://cs.jhu.edu/~juri).
+
+Grammar rules follow this format.
+
+    [LHS] ||| SOURCE-SIDE ||| TARGET-SIDE ||| FEATURES
+    
+The source and target sides contain a mixture of terminals and nonterminals. 
The nonterminals are
+linked across sides by indices. There is no limit to the number of paired 
nonterminals in the rule
+or on the nonterminal labels (Joshua supports decoding with SAMT and GHKM 
grammars).
+
+    [X] ||| el chico [X,1] ||| the boy [X,1] ||| -3.14 0 2 17
+    [S] ||| el chico [VP,1] ||| the boy [VP,1] ||| -3.14 0 2 17
+    [VP] ||| [NP,1] [IN,2] [VB,3] ||| [VB,3] [IN,2] [NP,1] ||| 0.0019026637 
0.81322956
+
+The feature values can have optional labels, e.g.:
+
+    [X] ||| el chico [X,1] ||| the boy [X,1] ||| lexprob=-3.14 lexicalized=1 
numwords=2 count=17
+    
+One file common to decoding is the glue grammar, which for hiero grammar is 
defined as follows:
+
+    [GOAL] ||| <s> ||| <s> ||| 0
+    [GOAL] ||| [GOAL,1] [X,2] ||| [GOAL,1] [X,2] ||| -1
+    [GOAL] ||| [GOAL,1] </s> ||| [GOAL,1] </s> ||| 0
+
+Joshua's [pipeline](pipeline.html) supports extraction of Hiero and SAMT 
grammars via
+[Thrax](thrax.html) or GHKM grammars using [Michel 
Galley](http://www-nlp.stanford.edu/~mgalley/)'s
+GHKM extractor (included) or Moses' GHKM extractor (if Moses is installed).
+
+## Language Model
+
+Joshua has two language model implementations: 
[KenLM](http://kheafield.com/code/kenlm/) and
+[BerkeleyLM](http://berkeleylm.googlecode.com).  All language model 
implementations support the
+standard ARPA format output by 
[SRILM](http://www.speech.sri.com/projects/srilm/).  In addition,
+KenLM and BerkeleyLM support compiled formats that can be loaded more quickly 
and efficiently. KenLM
+is written in C++ and is supported via a JNI bridge, while BerkeleyLM is 
written in Java. KenLM is
+the default because of its support for left-state minimization.
+
+### Compiling for KenLM
+
+To compile an ARPA grammar for KenLM, use the (provided) `build-binary` 
command, located deep within
+the Joshua source code:
+
+    $JOSHUA/bin/build_binary lm.arpa lm.kenlm
+    
+This script takes the `lm.arpa` file and produces the compiled version in 
`lm.kenlm`.
+
+### Compiling for BerkeleyLM
+
+To compile a grammar for BerkeleyLM, type:
+
+    java -cp $JOSHUA/lib/berkeleylm.jar -server -mxMEM 
edu.berkeley.nlp.lm.io.MakeLmBinaryFromArpa lm.arpa lm.berkeleylm
+
+The `lm.berkeleylm` file can then be listed directly in the [Joshua 
configuration file](decoder.html).
+
+## Joshua configuration file
+
+The [decoder page](decoder.html) documents decoder command-line and config 
file options.
+
+## Thrax configuration
+
+See [the thrax page](thrax.html) for more information about the Thrax 
configuration file.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/index.md
----------------------------------------------------------------------
diff --git a/6.0/index.md b/6.0/index.md
new file mode 100644
index 0000000..898464a
--- /dev/null
+++ b/6.0/index.md
@@ -0,0 +1,24 @@
+---
+layout: default6
+title: Joshua documentation
+---
+
+This page contains end-user oriented documentation for the 6.0 release of
+[the Joshua decoder](http://joshua-decoder.org/).
+
+To navigate the documentation, use the links on the navigation bar to
+the left. For more detail on the decoder itself, including its command-line 
options, see
+[the Joshua decoder page](decoder.html).  You can also learn more about other 
steps of
+[the Joshua MT pipeline](pipeline.html), including [grammar 
extraction](thrax.html) with Thrax and
+Joshua's [efficient grammar representation](packing.html).
+
+A [bundled configuration](bundle.html), which is a minimal set of 
configuration, resource, and script files, can be created and easily 
transferred and shared.
+
+## Development
+
+For developer support, please consult [the javadoc 
documentation](http://cs.jhu.edu/~post/joshua-docs) and the [Joshua developers 
mailing 
list](https://groups.google.com/forum/?fromgroups#!forum/joshua_developers).
+
+## Support
+
+If you have problems or issues, you might find some help [on our answers 
page](faq.html) or
+[in the mailing list 
archives](https://groups.google.com/forum/?fromgroups#!forum/joshua_support).

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/install.md
----------------------------------------------------------------------
diff --git a/6.0/install.md b/6.0/install.md
new file mode 100644
index 0000000..87e0079
--- /dev/null
+++ b/6.0/install.md
@@ -0,0 +1,88 @@
+---
+layout: default6
+title: Installation
+---
+
+### Download and install
+
+To use Joshua as a standalone decoder (with [language 
packs](/language-packs/)), you only need to download and install the runtime 
version of the decoder. 
+If you also wish to build translation models from your own data, you will want 
to install the full version.
+See the instructions below.
+
+1.  Set up some basic environment variables. 
+    You need to define `$JAVA_HOME`
+
+        export JAVA_HOME=/path/to/java
+
+        # JAVA_HOME is not very standardized. Here are some places to look:
+        # OS X:  export 
JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_71.jdk/Contents/Home
+        # Linux: export JAVA_HOME=/usr/java/default
+
+1.  If you are installing the full version of Joshua, you also need to define 
`$HADOOP` to point to your Hadoop installation.
+    (Joshua looks for the Hadoop executuble in `$HADOOP/bin/hadoop`)
+
+        export HADOOP=/usr
+
+    If you don't have a Hadoop installation, [Joshua's 
pipeline](pipeline.html) can install a standalone version for you.
+    
+1.  To install just the runtime version of Joshua, type
+
+        wget -q http://cs.jhu.edu/~post/files/joshua-runtime-{{ 
site.data.joshua.release_version }}.tgz
+
+    Then build everything
+
+        tar xzf joshua-runtime-{{ site.data.joshua.release_version }}.tgz
+        cd joshua-runtime-{{ site.data.joshua.release_version }}
+
+        # Add this to your init files
+        export JOSHUA=$(pwd)
+       
+        # build everything
+        ant
+
+1.  To instead install the full version, type
+
+        wget -q http://cs.jhu.edu/~post/files/joshua-{{ 
site.data.joshua.release_version }}.tgz
+
+        tar xzf joshua-{{ site.data.joshua.release_version }}.tgz
+        cd joshua-{{ site.data.joshua.release_version }}
+
+        # Add this to your init files
+        export JOSHUA=$(pwd)
+       
+        # build everything
+        ant
+
+### Building new models
+
+If you wish to build models for new language pairs from existing data (such as 
the [WMT data](http://statmt.org/wmt14/)), you need to install some additional 
dependencies.
+
+1. For learning hierarchical models, Joshua includes a tool called 
[Thrax](thrax.html), which
+is built on Hadoop. If you have a Hadoop installation, make sure that the 
environment variable
+`$HADOOP` is set and points to it. If you don't, Joshua will roll one out for 
you in standalone
+mode. Hadoop is only needed if you plan to build new models with Joshua.
+
+1. You will need to install Moses if either of the following applies to you:
+
+    - You wish to build [phrase-based models](phrase.html) (Joshua 6 includes 
a phrase-based
+      decoder, but not the tools for building such a model)
+
+    - You are building your own models (phrase- or syntax-based) and wish to 
use Cherry & Foster's
+[batch MIRA tuner](http://aclweb.org/anthology-new/N/N12/N12-1047v2.pdf) 
instead of the included
+MERT implementation, [Z-MERT](zmert.html). 
+
+    Follow [the instructions for installing Moses
+here](http://www.statmt.org/moses/?n=Development.GetStarted), and then define 
the `$MOSES`
+environment variable to point to the root of the Moses installation.
+
+## More information
+
+For more detail on the decoder itself, including its command-line options, see
+[the Joshua decoder page](decoder.html).  You can also learn more about other 
steps of
+[the Joshua MT pipeline](pipeline.html), including [grammar 
extraction](thrax.html) with Thrax and
+Joshua's [efficient grammar representation](packing.html).
+
+If you have problems or issues, you might find some help [on our answers 
page](faq.html) or
+[in the mailing list 
archives](https://groups.google.com/forum/?fromgroups#!forum/joshua_support).
+
+A [bundled configuration](bundle.html), which is a minimal set of 
configuration, resource, and script files, can be created and easily 
transferred and shared.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/jacana.md
----------------------------------------------------------------------
diff --git a/6.0/jacana.md b/6.0/jacana.md
new file mode 100644
index 0000000..71c1753
--- /dev/null
+++ b/6.0/jacana.md
@@ -0,0 +1,139 @@
+---
+layout: default6
+title: Alignment with Jacana
+---
+
+## Introduction
+
+jacana-xy is a token-based word aligner for machine translation, adapted from 
the original
+English-English word aligner jacana-align described in the following paper:
+
+    A Lightweight and High Performance Monolingual Word Aligner. Xuchen Yao, 
Benjamin Van Durme,
+    Chris Callison-Burch and Peter Clark. Proceedings of ACL 2013, short 
papers.
+
+It currently supports only aligning from French to English with a very limited 
feature set, from the
+one week hack at the [Eighth MT Marathon 2013](http://statmt.org/mtm13). 
Please feel free to check
+out the code, read to the bottom of this page, and
+[send the author an email](http://www.cs.jhu.edu/~xuchen/) if you want to add 
more language pairs to
+it.
+
+## Build
+
+jacana-xy is written in a mixture of Java and Scala. If you build from ant, 
you have to set up the
+environmental variables `JAVA_HOME` and `SCALA_HOME`. In my system, I have:
+
+    export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.26
+    export SCALA_HOME=/home/xuchen/Downloads/scala-2.10.2
+
+Then type:
+
+    ant
+
+build/lib/jacana-xy.jar will be built for you.
+
+If you build from Eclipse, first install scala-ide, then import the whole 
jacana folder as a Scala project. Eclipse should find the .project file and set 
up the project automatically for you.
+
+Demo
+scripts-align/runDemoServer.sh shows up the web demo. Direct your browser to 
http://localhost:8080/ and you should be able to align some sentences.
+
+Note: To make jacana-xy know where to look for resource files, pass the 
property JACANA_HOME with Java when you run it:
+
+java -DJACANA_HOME=/path/to/jacana -cp jacana-xy.jar ......
+
+Browser
+You can also browse one or two alignment files (*.json) with firefox opening 
src/web/AlignmentBrowser.html:
+
+
+
+Note 1: due to strict security setting for accessing local files, Chrome/IE 
won't work.
+
+Note 2: the input *.json files have to be in the same folder with 
AlignmentBrowser.html.
+
+Align
+scripts-align/alignFile.sh aligns tab-separated sentence files and outputs the 
output to a .json file that's accepted by the browser:
+
+java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -src fr -tgt en -m 
fr-en.model -a s.txt -o s.json
+
+scripts-align/alignFile.sh takes GIZA++-style input files (one file containing 
the source sentences, and the other file the target sentences) and outputs to 
one .align file with dashed alignment indices (e.g. "1-2 0-4"):
+
+java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -m fr-en.model -src fr 
-tgt en -a s1.txt -b s2.txt -o s.align
+
+Training
+java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -r train.json -d 
dev.json -t test.json -m /tmp/align.model
+
+The aligner then would train on train.json, and report F1 values on dev.json 
for every 10 iterations, when the stopping criterion has reached, it will test 
on test.json.
+
+For every 10 iterations, a model file is saved to (in this example) 
/tmp/align.model.iter_XX.F1_XX.X. Normally what I do is to select the one with 
the best F1 on dev.json, then run a final test on test.json:
+
+java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -t test.json -m 
/tmp/align.model.iter_XX.F1_XX.X
+
+In this case since the training data is missing, the aligner assumes it's a 
test job, then reads model file still from the -m option, and test on test.json.
+
+All the json files are in a format like the following (also accepted by the 
browser for display):
+
+[
+    {
+        "id": "0008",
+        "name": "Hansards.french-english.0008",
+        "possibleAlign": "0-0 0-1 0-2",
+        "source": "bravo !",
+        "sureAlign": "1-3",
+        "target": "hear , hear !"
+    },
+    {
+        "id": "0009",
+        "name": "Hansards.french-english.0009",
+        "possibleAlign": "1-1 6-5 7-5 6-6 7-6 13-10 13-11",
+        "source": "monsieur le Orateur , ma question se adresse Ã  le ministre 
chargÃ© de les transports .",
+        "sureAlign": "0-0 2-1 3-2 4-3 5-4 8-7 9-8 10-9 12-10 14-11 15-12",
+        "target": "Mr. Speaker , my question is directed to the Minister of 
Transport ."
+    }
+]
+Where possibleAlign is not used.
+
+The stopping criterion is to run up to 300 iterations or when the objective 
difference between two iterations is less than 0.001, whichever happens first. 
Currently they are hard-coded. If you need to be flexible on this, send me an 
email!
+
+Support More Languages
+To add support to more languages, you need:
+
+labelled word alignment (in the download there's already French-English under 
alignment-data/fr-en; I also have Chinese-English and Arabic-English; let me 
know if you have more). Usually 100 labelled sentence pairs would be enough
+implement some feature functions for this language pair
+To add more features, you need to implement the following interface:
+
+edu.jhu.jacana.align.feature.AlignFeature
+
+and override the following function:
+
+addPhraseBasedFeature
+
+For instance, a simple feature that checks whether the two words are 
translations in wiktionary for the French-English alignment task has the 
function implemented as:
+
+def addPhraseBasedFeature(pair: AlignPair, ins:AlignFeatureVector, i:Int, 
srcSpan:Int, j:Int, tgtSpan:Int,
+      currState:Int, featureAlphabet: Alphabet){
+  if (j == -1) {
+  } else {
+    val srcTokens = pair.srcTokens.slice(i, i+srcSpan).mkString(" ")
+    val tgtTokens = pair.tgtTokens.slice(j, j+tgtSpan).mkString(" ")
+                
+    if (WiktionaryMultilingual.exists(srcTokens, tgtTokens)) {
+      ins.addFeature("InWiktionary", NONE_STATE, currState, 1.0, srcSpan, 
featureAlphabet) 
+    }
+        
+  }       
+}
+This is a more general function that also deals with phrase alignment. But it 
is suggested to implement it just for token alignment as currently the phrase 
alignment part is very slow to train (60x slower than token alignment).
+
+Some other language-independent and English-only features are implemented 
under the package edu.jhu.jacana.align.feature, for instance:
+
+StringSimilarityAlignFeature: various string similarity measures
+
+PositionalAlignFeature: features based on relative sentence positions
+
+DistortionAlignFeature: Markovian (state transition) features
+
+When you add features for more languages, just create a new package like the 
one for French-English:
+
+edu.jhu.jacana.align.feature.fr_en
+
+and start coding!
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/large-lms.md
----------------------------------------------------------------------
diff --git a/6.0/large-lms.md b/6.0/large-lms.md
new file mode 100644
index 0000000..a6792dd
--- /dev/null
+++ b/6.0/large-lms.md
@@ -0,0 +1,192 @@
+---
+layout: default6
+title: Building large LMs with SRILM
+category: advanced
+---
+
+The following is a tutorial for building a large language model from the
+English Gigaword Fifth Edition corpus
+[LDC2011T07](http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07)
+using SRILM. English text is provided from seven different sources.
+
+### Step 0: Clean up the corpus
+
+The Gigaword corpus has to be stripped of all SGML tags and tokenized.
+Instructions for performing those steps are not included in this
+documentation. A description of this process can be found in a paper
+called ["Annotated
+Gigaword"](https://akbcwekex2012.files.wordpress.com/2012/05/28_paper.pdf).
+
+The Joshua package ships with a script that converts all alphabetical
+characters to their lowercase equivalent. The script is located at
+`$JOSHUA/scripts/lowercase.perl`.
+
+Make a directory structure as follows:
+
+    gigaword/
+    âââ corpus/
+    âÂ Â  âââ afp_eng/
+    âÂ Â  âÂ Â  âââ afp_eng_199405.lc.gz
+    âÂ Â  âÂ Â  âââ afp_eng_199406.lc.gz
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ apw_eng/
+    âÂ Â  âÂ Â  âââ apw_eng_199411.lc.gz
+    âÂ Â  âÂ Â  âââ apw_eng_199412.lc.gz
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ cna_eng/
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ ltw_eng/
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ nyt_eng/
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ wpb_eng/
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ xin_eng/
+    âÂ Â   Â Â  âââ ...
+    âÂ Â   Â Â  âââ counts/
+    âââ lm/
+     Â Â  âââ afp_eng/
+     Â Â  âââ apw_eng/
+     Â Â  âââ cna_eng/
+     Â Â  âââ ltw_eng/
+     Â Â  âââ nyt_eng/
+     Â Â  âââ wpb_eng/
+     Â Â  âââ xin_eng/
+
+
+The next step will be to build smaller LMs and then interpolate them into one
+file.
+
+### Step 1: Count ngrams
+
+Run the following script once from each source directory under the `corpus/`
+directory (edit it to specify the path to the `ngram-count` binary as well as
+the number of processors):
+
+    #!/bin/sh
+
+    NGRAM_COUNT=$SRILM_SRC/bin/i686-m64/ngram-count
+    args=""
+
+    for source in *.gz; do
+       args=$args"-sort -order 5 -text $source -write counts/$source-counts.gz 
"
+    done
+
+    echo $args | xargs --max-procs=4 -n 7 $NGRAM_COUNT
+
+Then move each `counts/` directory to the corresponding directory under
+`lm/`. Now that each ngram has been counted, we can make a language
+model for each of the seven sources.
+
+### Step 2: Make individual language models
+
+SRILM includes a script, called `make-big-lm`, for building large language
+models under resource-limited environments. The manual for this script can be
+read online
+[here](http://www-speech.sri.com/projects/srilm/manpages/training-scripts.1.html).
+Since the Gigaword corpus is so large, it is convenient to use `make-big-lm`
+even in environments with many parallel processors and a lot of memory.
+
+Initiate the following script from each of the source directories under the
+`lm/` directory (edit it to specify the path to the `make-big-lm` script as
+well as the pruning threshold):
+
+    #!/bin/bash
+    set -x
+
+    CMD=$SRILM_SRC/bin/make-big-lm
+    PRUNE_THRESHOLD=1e-8
+
+    $CMD \
+      -name gigalm `for k in counts/*.gz; do echo " \
+      -read $k "; done` \
+      -lm lm.gz \
+      -max-per-file 100000000 \
+      -order 5 \
+      -kndiscount \
+      -interpolate \
+      -unk \
+      -prune $PRUNE_THRESHOLD
+
+The language model attributes chosen are the following:
+
+* N-grams up to order 5
+* Kneser-Ney smoothing
+* N-gram probability estimates at the specified order *n* are interpolated with
+  lower-order estimates
+* include the unknown-word token as a regular word
+* pruning N-grams based on the specified threshold
+
+Next, we will mix the models together into a single file.
+
+### Step 3: Mix models together
+
+Using development text, interpolation weights can determined that give highest
+weight to the source language models that have the lowest perplexity on the
+specified development set.
+
+#### Step 3-1: Determine interpolation weights
+
+Initiate the following script from the `lm/` directory (edit it to specify the
+path to the `ngram` binary as well as the path to the development text file):
+
+    #!/bin/bash
+    set -x
+
+    NGRAM=$SRILM_SRC/bin/i686-m64/ngram
+    DEV_TEXT=~mpost/expts/wmt12/runs/es-en/data/tune/tune.tok.lc.es
+
+    dirs=( afp_eng apw_eng cna_eng ltw_eng nyt_eng wpb_eng xin_eng )
+
+    for d in ${dirs[@]} ; do
+      $NGRAM -debug 2 -order 5 -unk -lm $d/lm.gz -ppl $DEV_TEXT > $d/lm.ppl ;
+    done
+
+    compute-best-mix */lm.ppl > best-mix.ppl
+
+Take a look at the contents of `best-mix.ppl`. It will contain a sequence of
+values in parenthesis. These are the interpolation weights of the source
+language models in the order specified. Copy and paste the values within the
+parenthesis into the script below.
+
+#### Step 3-2: Combine the models
+
+Initiate the following script from the `lm/` directory (edit it to specify the
+path to the `ngram` binary as well as the interpolation weights):
+
+    #!/bin/bash
+    set -x
+
+    NGRAM=$SRILM_SRC/bin/i686-m64/ngram
+    DIRS=(   afp_eng    apw_eng     cna_eng  ltw_eng   nyt_eng  wpb_eng  
xin_eng )
+    LAMBDAS=(0.00631272 0.000647602 0.251555 0.0134726 0.348953 0.371566 
0.00749238)
+
+    $NGRAM -order 5 -unk \
+      -lm      ${DIRS[0]}/lm.gz     -lambda  ${LAMBDAS[0]} \
+      -mix-lm  ${DIRS[1]}/lm.gz \
+      -mix-lm2 ${DIRS[2]}/lm.gz -mix-lambda2 ${LAMBDAS[2]} \
+      -mix-lm3 ${DIRS[3]}/lm.gz -mix-lambda3 ${LAMBDAS[3]} \
+      -mix-lm4 ${DIRS[4]}/lm.gz -mix-lambda4 ${LAMBDAS[4]} \
+      -mix-lm5 ${DIRS[5]}/lm.gz -mix-lambda5 ${LAMBDAS[5]} \
+      -mix-lm6 ${DIRS[6]}/lm.gz -mix-lambda6 ${LAMBDAS[6]} \
+      -write-lm mixed_lm.gz
+
+The resulting file, `mixed_lm.gz` is a language model based on all the text in
+the Gigaword corpus and with some probabilities biased to the development text
+specify in step 3-1. It is in the ARPA format. The optional next step converts
+it into KenLM format.
+
+#### Step 3-3: Convert to KenLM
+
+The KenLM format has some speed advantages over the ARPA format. Issuing the
+following command will write a new language model file `mixed_lm-kenlm.gz` that
+is the `mixed_lm.gz` language model transformed into the KenLM format.
+
+    $JOSHUA/src/joshua/decoder/ff/lm/kenlm/build_binary mixed_lm.gz 
mixed_lm.kenlm
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/packing.md
----------------------------------------------------------------------
diff --git a/6.0/packing.md b/6.0/packing.md
new file mode 100644
index 0000000..8d84004
--- /dev/null
+++ b/6.0/packing.md
@@ -0,0 +1,74 @@
+---
+layout: default6
+category: advanced
+title: Grammar Packing
+---
+
+Grammar packing refers to the process of taking a textual grammar
+output by [Thrax](thrax.html) (or Moses, for phrase-based models) and
+efficiently encoding it so that it can be loaded
+[very quickly](https://aclweb.org/anthology/W/W12/W12-3134.pdf) ---
+packing the grammar results in significantly faster load times for
+very large grammars.  Packing is done automatically by the
+[Joshua pipeline](pipeline.html), but you can also run the packer
+manually.
+
+The script can be found at
+`$JOSHUA/scripts/support/grammar-packer.pl`. See that script for
+example usage. You can then add it to a Joshua config file, simply
+replacing a `tm` path to the compressed text-file format with a path
+to the packed grammar directory (Joshua will automatically detect that
+it is packed, since a packed grammar is a directory).
+
+Packing the grammar requires first sorting it by the rules source side,
+which can take quite a bit of temporary space.
+
+*CAVEAT*: You may run into problems packing very very large Hiero
+ grammars. Email the support list if you do.
+
+### Examples
+
+A Hiero grammar, using the compressed text file version:
+
+    tm = hiero -owner pt -maxspan 20 -path grammar.filtered.gz
+
+Pack it:
+
+    $JOSHUA/scripts/support/grammar-packer.pl grammar.filtered.gz 
grammar.packed
+
+Pack a really big grammar:
+
+    $JOSHUA/scripts/support/grammar-packer.pl -m 30g grammar.filtered.gz 
grammar.packed
+
+Be a little more verbose:
+
+    $JOSHUA/scripts/support/grammar-packer.pl -m 30g grammar.filtered.gz 
grammar.packed
+
+You have a different temp file location:
+
+    $JOSHUA/scripts/support/grammar-packer.pl -T /local grammar.filtered.gz 
grammar.packed
+
+Update the config file line:
+
+    tm = hiero -owner pt -maxspan 20 -path grammar.packed
+
+### Using multiple packed grammars (Joshua 6.0.5)
+
+Packed grammars serialize their vocabularies which prevented the use of 
multiple
+packed grammars during decoding. With Joshua 6.0.5, it is possible to use 
multiple packed grammars during decoding if they have the same serialized 
vocabulary.
+This is achieved by packing these grammars jointly using a revised packing CLI.
+
+To pack multiple grammars:
+
+    $JOSHUA/scripts/support/grammar-packer.pl grammar1.filtered.gz 
grammar2.filtered.gz [...] grammar1.packed grammar2.packed [...]
+
+This will produce two packed grammars with the same vocabulary. To use them in 
the decoder, put this in your ```joshua.config```:
+
+    tm = hiero -owner pt -maxspan 20 -path grammar1.packed
+    tm = hiero -owner pt2 -maxspan 20 -path grammar2.packed
+
+Note the different owners.
+If you are trying to load multiple packed grammars that do not have the same
+vocabulary, the decoder will throw a RuntimeException at loading time:
+
+    Exception in thread "main" java.lang.RuntimeException: Trying to load 
multiple packed grammars with different vocabularies! Have you packed them 
jointly?

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/pipeline.md
----------------------------------------------------------------------
diff --git a/6.0/pipeline.md b/6.0/pipeline.md
new file mode 100644
index 0000000..4389435
--- /dev/null
+++ b/6.0/pipeline.md
@@ -0,0 +1,666 @@
+---
+layout: default6
+category: links
+title: The Joshua Pipeline
+---
+
+*Please note that the Joshua 6.0.3 included some big changes to directory 
organization of the
+ pipeline's files.*
+
+This page describes the Joshua pipeline script, which manages the complexity 
of training and
+evaluating machine translation systems.  The pipeline eases the pain of two 
related tasks in
+statistical machine translation (SMT) research:
+
+- Training SMT systems involves a complicated process of interacting steps 
that are
+  time-consuming and prone to failure.
+
+- Developing and testing new techniques requires varying parameters at 
different points in the
+  pipeline. Earlier results (which are often expensive) need not be recomputed.
+
+To facilitate these tasks, the pipeline script:
+
+- Runs the complete SMT pipeline, from corpus normalization and tokenization, 
through alignment,
+  model building, tuning, test-set decoding, and evaluation.
+
+- Caches the results of intermediate steps (using robust SHA-1 checksums on 
dependencies), so the
+  pipeline can be debugged or shared across similar runs while doing away with 
time spent
+  recomputing expensive steps.
+ 
+- Allows you to jump into and out of the pipeline at a set of predefined 
places (e.g., the alignment
+  stage), so long as you provide the missing dependencies.
+
+The Joshua pipeline script is designed in the spirit of Moses' 
`train-model.pl`, and shares
+(and has borrowed) many of its features.  It is not as extensive as Moses'
+[Experiment Management 
System](http://www.statmt.org/moses/?n=FactoredTraining.EMS), which allows
+the user to define arbitrary execution dependency graphs. However, it is 
significantly simpler to
+use, allowing many systems to be built with a single command (that may run for 
days or weeks).
+
+## Dependencies
+
+The pipeline has no *required* external dependencies.  However, it has support 
for a number of
+external packages, some of which are included with Joshua.
+
+-  [GIZA++](http://code.google.com/p/giza-pp/) (included)
+
+   GIZA++ is the default aligner.  It is included with Joshua, and should 
compile successfully when
+   you typed `ant` from the Joshua root directory.  It is not required because 
you can use the
+   (included) Berkeley aligner (`--aligner berkeley`). We have recently also 
provided support
+   for the [Jacana-XY 
aligner](http://code.google.com/p/jacana-xy/wiki/JacanaXY) (`--aligner
+   jacana`). 
+
+-  [Hadoop](http://hadoop.apache.org/) (included)
+
+   The pipeline uses the [Thrax grammar extractor](thrax.html), which is built 
on Hadoop.  If you
+   have a Hadoop installation, simply ensure that the `$HADOOP` environment 
variable is defined, and
+   the pipeline will use it automatically at the grammar extraction step.  If 
you are going to
+   attempt to extract very large grammars, it is best to have a good-sized 
Hadoop installation.
+   
+   (If you do not have a Hadoop installation, you might consider setting one 
up.  Hadoop can be
+   installed in a
+   
["pseudo-distributed"](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed)
+   mode that allows it to use just a few machines or a number of processors on 
a single machine.
+   The main issue is to ensure that there are a lot of independent physical 
disks, since in our
+   experience Hadoop starts to exhibit lots of hard-to-trace problems if there 
is too much demand on
+   the disks.)
+   
+   If you don't have a Hadoop installation, there are still no worries.  The 
pipeline will unroll a
+   standalone installation and use it to extract your grammar.  This behavior 
will be triggered if
+   `$HADOOP` is undefined.
+   
+-  [Moses](http://statmt.org/moses/) (not included). Moses is needed
+   if you wish to use its 'kbmira' tuner (--tuner kbmira), or if you
+   wish to build phrase-based models.
+   
+-  [SRILM](http://www.speech.sri.com/projects/srilm/) (not included; not 
needed; not recommended)
+
+   By default, the pipeline uses the included 
[KenLM](https://kheafield.com/code/kenlm/) for
+   building (and also querying) language models. Joshua also includes a Java 
program from the
+   [Berkeley LM](http://code.google.com/p/berkeleylm/) package that contains 
code for constructing a
+   Kneser-Ney-smoothed language model in ARPA format from the target side of 
your training data.  
+   There is no need to use SRILM, but if you do wish to use it, you need to do 
the following:
+   
+   1. Install SRILM and set the `$SRILM` environment variable to point to its 
installed location.
+   1. Add the `--lm-gen srilm` flag to your pipeline invocation.
+   
+   More information on this is available in the [LM building section of the 
pipeline](#lm).  SRILM
+   is not used for representing language models during decoding (and in fact 
is not supported,
+   having been supplanted by [KenLM](http://kheafield.com/code/kenlm/) (the 
default) and
+   BerkeleyLM).
+
+After installing any dependencies, follow the brief instructions on
+the [installation page](install.html), and then you are ready to build
+models. 
+
+## A basic pipeline run
+
+The pipeline takes a set of inputs (training, tuning, and test data), and 
creates a set of
+intermediate files in the *run directory*.  By default, the run directory is 
the current directory,
+but it can be changed with the `--rundir` parameter.
+
+For this quick start, we will be working with the example that can be found in
+`$JOSHUA/examples/training`.  This example contains 1,000 sentences of 
Urdu-English data (the full
+dataset is available as part of the
+[Indian languages parallel corpora](/indian-parallel-corpora/) with
+100-sentence tuning and test sets with four references each.
+
+Running the pipeline requires two main steps: data preparation and invocation.
+
+1. Prepare your data.  The pipeline script needs to be told where to find the 
raw training, tuning,
+   and test data.  A good convention is to place these files in an input/ 
subdirectory of your run's
+   working directory (NOTE: do not use `data/`, since a directory of that name 
is created and used
+   by the pipeline itself for storing processed files).  The expected format 
(for each of training,
+   tuning, and test) is a pair of files that share a common path prefix and 
are distinguished by
+   their extension, e.g.,
+
+       input/
+             train.SOURCE
+             train.TARGET
+             tune.SOURCE
+             tune.TARGET
+             test.SOURCE
+             test.TARGET
+
+   These files should be parallel at the sentence level (with one sentence per 
line), should be in
+   UTF-8, and should be untokenized (tokenization occurs in the pipeline).  
SOURCE and TARGET denote
+   variables that should be replaced with the actual target and source 
language abbreviations (e.g.,
+   "ur" and "en").
+   
+1. Run the pipeline.  The following is the minimal invocation to run the 
complete pipeline:
+
+       $JOSHUA/bin/pipeline.pl  \
+         --rundir .             \
+         --type hiero           \
+         --corpus input/train   \
+         --tune input/tune      \
+         --test input/devtest   \
+         --source SOURCE        \
+         --target TARGET
+
+   The `--corpus`, `--tune`, and `--test` flags define file prefixes that are 
concatened with the
+   language extensions given by `--target` and `--source` (with a "." in 
between).  Note the
+   correspondences with the files defined in the first step above.  The 
prefixes can be either
+   absolute or relative pathnames.  This particular invocation assumes that a 
subdirectory `input/`
+   exists in the current directory, that you are translating from a language 
identified "ur"
+   extension to a language identified by the "en" extension, that the training 
data can be found at
+   `input/train.en` and `input/train.ur`, and so on.
+
+*Don't* run the pipeline directly from `$JOSHUA`, or, for that matter, in any 
directory with lots of other files.
+This can cause problems because the pipeline creates lots of files under 
`--rundir` that can clobber existing files.
+You should run experiments in a clean directory.
+For example, if you have Joshua installed in `$HOME/code/joshua`, manage your 
runs in a different location, such as `$HOME/expts/joshua`.
+
+Assuming no problems arise, this command will run the complete pipeline in 
about 20 minutes,
+producing BLEU scores at the end.  As it runs, you will see output that looks 
like the following:
+   
+    [train-copy-en] rebuilding...
+      dep=/Users/post/code/joshua/test/pipeline/input/train.en 
+      dep=data/train/train.en.gz [NOT FOUND]
+      cmd=cat /Users/post/code/joshua/test/pipeline/input/train.en | gzip -9n 
> data/train/train.en.gz
+      took 0 seconds (0s)
+    [train-copy-ur] rebuilding...
+      dep=/Users/post/code/joshua/test/pipeline/input/train.ur 
+      dep=data/train/train.ur.gz [NOT FOUND]
+      cmd=cat /Users/post/code/joshua/test/pipeline/input/train.ur | gzip -9n 
> data/train/train.ur.gz
+      took 0 seconds (0s)
+    ...
+   
+And in the current directory, you will see the following files (among
+other files, including intermediate files
+generated by the individual sub-steps).
+   
+    data/
+        train/
+            corpus.ur
+            corpus.en
+            thrax-input-file
+        tune/
+            corpus.ur -> tune.tok.lc.ur
+            corpus.en -> tune.tok.lc.en
+            grammar.filtered.gz
+            grammar.glue
+        test/
+            corpus.ur -> test.tok.lc.ur
+            corpus.en -> test.tok.lc.en
+            grammar.filtered.gz
+            grammar.glue
+    alignments/
+        0/
+            [giza/berkeley aligner output files]
+        1/
+        ...
+        training.align
+    thrax-hiero.conf
+    thrax.log
+    grammar.gz
+    lm.gz
+    tune/
+         decoder_command
+         model/
+               [model files]
+         params.txt
+         joshua.log
+         mert.log
+         joshua.config.final
+         final-bleu
+    test/
+         model/
+               [model files]
+         output
+         final-bleu
+
+These files will be described in more detail in subsequent sections of this 
tutorial.
+
+Another useful flag is the `--rundir DIR` flag, which chdir()s to the 
specified directory before
+running the pipeline.  By default the rundir is the current directory.  
Changing it can be useful
+for organizing related pipeline runs.  In fact, we highly recommend
+that you organize your runs using consecutive integers, also taking a
+minute to pass a short note with the `--readme` flag, which allows you
+to quickly generate reports on [groups of related experiments](#managing).
+Relative paths specified to other flags (e.g., to `--corpus`
+or `--lmfile`) are relative to the directory the pipeline was called *from*, 
not the rundir itself
+(unless they happen to be the same, of course).
+
+The complete pipeline comprises many tens of small steps, which can be grouped 
together into a set
+of traditional pipeline tasks:
+   
+1. [Data preparation](#prep)
+1. [Alignment](#alignment)
+1. [Parsing](#parsing) (syntax-based grammars only)
+1. [Grammar extraction](#tm)
+1. [Language model building](#lm)
+1. [Tuning](#tuning)
+1. [Testing](#testing)
+1. [Analysis](#analysis)
+
+These steps are discussed below, after a few intervening sections about 
high-level details of the
+pipeline.
+
+## <a id="managing" /> Managing groups of experiments
+
+The real utility of the pipeline comes when you use it to manage groups of 
experiments. Typically,
+there is a held-out test set, and we want to vary a number of training 
parameters to determine what
+effect this has on BLEU scores or some other metric. Joshua comes with a script
+`$JOSHUA/scripts/training/summarize.pl` that collects information from a group 
of runs and reports
+them to you. This script works so long as you organize your runs as follows:
+
+1. Your runs should be grouped together in a root directory, which I'll call 
`$EXPDIR`.
+
+2. For comparison purposes, the runs should all be evaluated on the same test 
set.
+
+3. Each run in the run group should be in its own numbered directory, shown 
with the files used by
+the summarize script:
+
+       $RUNDIR/
+           1/
+               README.txt
+               test/
+                   final-bleu
+                   final-times
+               [other files]
+           2/
+               README.txt
+               test/
+                   final-bleu
+                   final-times
+               [other files]
+               ...
+               
+You can get such directories using the `--rundir N` flag to the pipeline. 
+
+Run directories can build off each other. For example, `1/` might contain a 
complete baseline
+run. If you wanted to just change the tuner, you don't need to rerun the 
aligner and model builder,
+so you can reuse the results by supplying the second run with the information 
it needs that was
+computed in step 1:
+
+    $JOSHUA/bin/pipeline.pl \
+      --first-step tune \
+      --grammar 1/grammar.gz \
+      ...
+      
+More details are below.
+
+## Grammar options
+
+Hierarchical Joshua can extract three types of grammars: Hiero
+grammars, GHKM, and SAMT grammars.  As described on the
+[file formats page](file-formats.html), all of them are encoded into
+the same file format, but they differ in terms of the richness of
+their nonterminal sets.
+
+Hiero grammars make use of a single nonterminals, and are extracted by 
computing phrases from
+word-based alignments and then subtracting out phrase differences.  More 
detail can be found in
+[Chiang (2007) 
[PDF]](http://www.mitpressjournals.org/doi/abs/10.1162/coli.2007.33.2.201).
+[GHKM](http://www.isi.edu/%7Emarcu/papers/cr_ghkm_naacl04.pdf) (new with 5.0) 
and
+[SAMT](http://www.cs.cmu.edu/~zollmann/samt/) grammars make use of a source- 
or target-side parse
+tree on the training data, differing in the way they extract rules using these 
trees: GHKM extracts
+synchronous tree substitution grammar rules rooted in a subset of the tree 
constituents, whereas
+SAMT projects constituent labels down onto phrases.  SAMT grammars are usually 
many times larger and
+are much slower to decode with, but sometimes increase BLEU score.  Both 
grammar formats are
+extracted with the [Thrax software](thrax.html).
+
+By default, the Joshua pipeline extract a Hiero grammar, but this can be 
altered with the `--type
+(ghkm|samt)` flag. For GHKM grammars, the default is to use
+[Michel Galley's 
extractor](http://www-nlp.stanford.edu/~mgalley/software/stanford-ghkm-latest.tar.gz),
+but you can also use Moses' extractor with `--ghkm-extractor moses`. Galley's 
extractor only outputs
+two features, so the scores tend to be significantly lower than that of Moses'.
+
+Joshua (new in version 6) also includes an unlexicalized phrase-based
+decoder. Building a phrase-based model requires you to have Moses
+installed, since its `train-model.perl` script is used to extract the
+phrase table. You can enable this by defining the `$MOSES` environment
+variable and then specifying `--type phrase`.
+
+## Other high-level options
+
+The following command-line arguments control run-time behavior of multiple 
steps:
+
+- `--threads N` (1)
+
+  This enables multithreaded operation for a number of steps: alignment (with 
GIZA, max two
+  threads), parsing, and decoding (any number of threads)
+  
+- `--jobs N` (1)
+
+  This enables parallel operation over a cluster using the qsub command.  This 
feature is not
+  well-documented at this point, but you will likely want to edit the file
+  `$JOSHUA/scripts/training/parallelize/LocalConfig.pm` to setup your qsub 
environment, and may also
+  want to pass specific qsub commands via the `--qsub-args "ARGS"`
+  command. We suggest you stick to the standard Joshua model that
+  tries to use as many cores as are available with the `--threads N` option.
+
+## Restarting failed runs
+
+If the pipeline dies, you can restart it with the same command you used the 
first time.  If you
+rerun the pipeline with the exact same invocation as the previous run (or an 
overlapping
+configuration -- one that causes the same set of behaviors), you will see 
slightly different
+output compared to what we saw above:
+
+    [train-copy-en] cached, skipping...
+    [train-copy-ur] cached, skipping...
+    ...
+
+This indicates that the caching module has discovered that the step was 
already computed and thus
+did not need to be rerun.  This feature is quite useful for restarting 
pipeline runs that have
+crashed due to bugs, memory limitations, hardware failures, and the myriad 
other problems that
+plague MT researchers across the world.
+
+Often, a command will die because it was parameterized incorrectly.  For 
example, perhaps the
+decoder ran out of memory.  This allows you to adjust the parameter (e.g., 
`--joshua-mem`) and rerun
+the script.  Of course, if you change one of the parameters a step depends on, 
it will trigger a
+rerun, which in turn might trigger further downstream reruns.
+   
+## <a id="steps" /> Skipping steps, quitting early
+
+You will also find it useful to start the pipeline somewhere other than data 
preparation (for
+example, if you have already-processed data and an alignment, and want to 
begin with building a
+grammar) or to end it prematurely (if, say, you don't have a test set and just 
want to tune a
+model).  This can be accomplished with the `--first-step` and `--last-step` 
flags, which take as
+argument a case-insensitive version of the following steps:
+
+- *FIRST*: Data preparation.  Everything begins with data preparation.  This 
is the default first
+   step, so there is no need to be explicit about it.
+
+- *ALIGN*: Alignment.  You might want to start here if you want to skip data 
preprocessing.
+
+- *PARSE*: Parsing.  This is only relevant for building SAMT grammars (`--type 
samt`), in which case
+   the target side (`--target`) of the training data (`--corpus`) is parsed 
before building a
+   grammar.
+
+- *THRAX*: Grammar extraction [with Thrax](thrax.html).  If you jump to this 
step, you'll need to
+   provide an aligned corpus (`--alignment`) along with your parallel data.  
+
+- *TUNE*: Tuning.  The exact tuning method is determined with `--tuner 
{mert,mira,pro}`.  With this
+   option, you need to specify a grammar (`--grammar`) or separate tune 
(`--tune-grammar`) and test
+   (`--test-grammar`) grammars.  A full grammar (`--grammar`) will be filtered 
against the relevant
+   tuning or test set unless you specify `--no-filter-tm`.  If you want a 
language model built from
+   the target side of your training data, you'll also need to pass in the 
training corpus
+   (`--corpus`).  You can also specify an arbitrary number of additional 
language models with one or
+   more `--lmfile` flags.
+
+- *TEST*: Testing.  If you have a tuned model file, you can test new corpora 
by passing in a test
+   corpus with references (`--test`).  You'll need to provide a run name 
(`--name`) to store the
+   results of this run, which will be placed under `test/NAME`.  You'll also 
need to provide a
+   Joshua configuration file (`--joshua-config`), one or more language models 
(`--lmfile`), and a
+   grammar (`--grammar`); this will be filtered to the test data unless you 
specify
+   `--no-filter-tm`) or unless you directly provide a filtered test grammar 
(`--test-grammar`).
+
+- *LAST*: The last step.  This is the default target of `--last-step`.
+
+We now discuss these steps in more detail.
+
+### <a id="prep" /> 1. DATA PREPARATION
+
+Data prepare involves doing the following to each of the training data 
(`--corpus`), tuning data
+(`--tune`), and testing data (`--test`).  Each of these values is an absolute 
or relative path
+prefix.  To each of these prefixes, a "." is appended, followed by each of 
SOURCE (`--source`) and
+TARGET (`--target`), which are file extensions identifying the languages.  The 
SOURCE and TARGET
+files must have the same number of lines.  
+
+For tuning and test data, multiple references are handled automatically.  A 
single reference will
+have the format TUNE.TARGET, while multiple references will have the format 
TUNE.TARGET.NUM, where
+NUM starts at 0 and increments for as many references as there are.
+
+The following processing steps are applied to each file.
+
+1.  **Copying** the files into `$RUNDIR/data/TYPE`, where TYPE is one of 
"train", "tune", or "test".
+    Multiple `--corpora` files are concatenated in the order they are 
specified.  Multiple `--tune`
+    and `--test` flags are not currently allowed.
+    
+1.  **Normalizing** punctuation and text (e.g., removing extra spaces, 
converting special
+    quotations).  There are a few language-specific options that depend on the 
file extension
+    matching the [two-letter ISO 
639-1](http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
+    designation.
+
+1.  **Tokenizing** the data (e.g., separating out punctuation, converting 
brackets).  Again, there
+    are language-specific tokenizations for a few languages (English, German, 
and Greek).
+
+1.  (Training only) **Removing** all parallel sentences with more than 
`--maxlen` tokens on either
+    side.  By default, MAXLEN is 50.  To turn this off, specify `--maxlen 0`.
+
+1.  **Lowercasing**.
+
+This creates a series of intermediate files which are saved for posterity but 
compressed.  For
+example, you might see
+
+    data/
+        train/
+            train.en.gz
+            train.tok.en.gz
+            train.tok.50.en.gz
+            train.tok.50.lc.en
+            corpus.en -> train.tok.50.lc.en
+
+The file "corpus.LANG" is a symbolic link to the last file in the chain.  
+
+## 2. ALIGNMENT <a id="alignment" />
+
+Alignments are between the parallel corpora at 
`$RUNDIR/data/train/corpus.{SOURCE,TARGET}`.  To
+prevent the alignment tables from getting too big, the parallel corpora are 
grouped into files of no
+more than ALIGNER\_CHUNK\_SIZE blocks (controlled with a parameter below).  
The last block is folded
+into the penultimate block if it is too small.  These chunked files are all 
created in a
+subdirectory of `$RUNDIR/data/train/splits`, named `corpus.LANG.0`, 
`corpus.LANG.1`, and so on.
+
+The pipeline parameters affecting alignment are:
+
+-   `--aligner ALIGNER` {giza (default), berkeley, jacana}
+
+    Which aligner to use.  The default is 
[GIZA++](http://code.google.com/p/giza-pp/), but
+    [the Berkeley aligner](http://code.google.com/p/berkeleyaligner/) can be 
used instead.  When
+    using the Berkeley aligner, you'll want to pay attention to how much 
memory you allocate to it
+    with `--aligner-mem` (the default is 10g).
+
+-   `--aligner-chunk-size SIZE` (1,000,000)
+
+    The number of sentence pairs to compute alignments over. The training data 
is split into blocks
+    of this size, aligned separately, and then concatenated.
+    
+-   `--alignment FILE`
+
+    If you have an already-computed alignment, you can pass that to the script 
using this flag.
+    Note that, in this case, you will want to skip data preparation and 
alignment using
+    `--first-step thrax` (the first step after alignment) and also to specify 
`--no-prepare` so
+    as not to retokenize the data and mess with your alignments.
+    
+    The alignment file format is the standard format where 0-indexed many-many 
alignment pairs for a
+    sentence are provided on a line, source language first, e.g.,
+
+      0-0 0-1 1-2 1-7 ...
+
+    This value is required if you start at the grammar extraction step.
+
+When alignment is complete, the alignment file can be found at 
`$RUNDIR/alignments/training.align`.
+It is parallel to the training corpora.  There are many files in the 
`alignments/` subdirectory that
+contain the output of intermediate steps.
+
+### <a id="parsing" /> 3. PARSING
+
+To build SAMT and GHKM grammars (`--type samt` and `--type ghkm`), the target 
side of the
+training data must be parsed. The pipeline assumes your target side will be 
English, and will parse
+it for you using [the Berkeley 
parser](http://code.google.com/p/berkeleyparser/), which is included.
+If it is not the case that English is your target-side language, the target 
side of your training
+data (found at CORPUS.TARGET) must already be parsed in PTB format.  The 
pipeline will notice that
+it is parsed and will not reparse it.
+
+Parsing is affected by both the `--threads N` and `--jobs N` options.  The 
former runs the parser in
+multithreaded mode, while the latter distributes the runs across as cluster 
(and requires some
+configuration, not yet documented).  The options are mutually exclusive.
+
+Once the parsing is complete, there will be two parsed files:
+
+- `$RUNDIR/data/train/corpus.en.parsed`: this is the mixed-case file that was 
parsed.
+- `$RUNDIR/data/train/corpus.parsed.en`: this is a leaf-lowercased version of 
the above file used for
+  grammar extraction.
+
+## 4. THRAX (grammar extraction) <a id="tm" />
+
+The grammar extraction step takes three pieces of data: (1) the 
source-language training corpus, (2)
+the target-language training corpus (parsed, if an SAMT grammar is being 
extracted), and (3) the
+alignment file.  From these, it computes a synchronous context-free grammar.  
If you already have a
+grammar and wish to skip this step, you can do so passing the grammar with the 
`--grammar
+/path/to/grammar` flag.
+
+The main variable in grammar extraction is Hadoop.  If you have a Hadoop 
installation, simply ensure
+that the environment variable `$HADOOP` is defined, and Thrax will seamlessly 
use it.  If you *do
+not* have a Hadoop installation, the pipeline will roll out out for you, 
running Hadoop in
+standalone mode (this mode is triggered when `$HADOOP` is undefined).  
Theoretically, any grammar
+extractable on a full Hadoop cluster should be extractable in standalone mode, 
if you are patient
+enough; in practice, you probably are not patient enough, and will be limited 
to smaller
+datasets. You may also run into problems with disk space; Hadoop uses a lot 
(use `--tmp
+/path/to/tmp` to specify an alternate place for temporary data; we suggest you 
use a local disk
+partition with tens or hundreds of gigabytes free, and not an NFS partition).  
Setting up your own
+Hadoop cluster is not too difficult a chore; in particular, you may find it 
helpful to install a
+[pseudo-distributed version of 
Hadoop](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html).
+In our experience, this works fine, but you should note the following caveats:
+
+- It is of crucial importance that you have enough physical disks.  We have 
found that having too
+  few, or too slow of disks, results in a whole host of seemingly unrelated 
issues that are hard to
+  resolve, such as timeouts.  
+- NFS filesystems can cause lots of problems.  You should really try to 
install physical disks that
+  are dedicated to Hadoop scratch space.
+
+Here are some flags relevant to Hadoop and grammar extraction with Thrax:
+
+- `--hadoop /path/to/hadoop`
+
+  This sets the location of Hadoop (overriding the environment variable 
`$HADOOP`)
+  
+- `--hadoop-mem MEM` (2g)
+
+  This alters the amount of memory available to Hadoop mappers (passed via the
+  `mapred.child.java.opts` options).
+  
+- `--thrax-conf FILE`
+
+   Use the provided Thrax configuration file instead of the (grammar-specific) 
default.  The Thrax
+   templates are located at 
`$JOSHUA/scripts/training/templates/thrax-TYPE.conf`, where TYPE is one
+   of "hiero" or "samt".
+  
+When the grammar is extracted, it is compressed and placed at 
`$RUNDIR/grammar.gz`.
+
+## <a id="lm" /> 5. Language model
+
+Before tuning can take place, a language model is needed.  A language model is 
always built from the
+target side of the training corpus unless `--no-corpus-lm` is specified.  In 
addition, you can
+provide other language models (any number of them) with the `--lmfile FILE` 
argument.  Other
+arguments are as follows.
+
+-  `--lm` {kenlm (default), berkeleylm}
+
+   This determines the language model code that will be used when decoding.  
These implementations
+   are described in their respective papers (PDFs:
+   [KenLM](http://kheafield.com/professional/avenue/kenlm.pdf),
+   
[BerkeleyLM](http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf)). 
KenLM is written in
+   C++ and requires a pass through the JNI, but is recommended because it 
supports left-state minimization.
+   
+- `--lmfile FILE`
+
+  Specifies a pre-built language model to use when decoding.  This language 
model can be in ARPA
+  format, or in KenLM format when using KenLM or BerkeleyLM format when using 
that format.
+
+- `--lm-gen` {kenlm (default), srilm, berkeleylm}, `--buildlm-mem MEM`, 
`--witten-bell`
+
+  At the tuning step, an LM is built from the target side of the training data 
(unless
+  `--no-corpus-lm` is specified).  This controls which code is used to build 
it.  The default is a
+  KenLM's [lmplz](http://kheafield.com/code/kenlm/estimation/), and is 
strongly recommended.
+  
+  If SRILM is used, it is called with the following arguments:
+  
+        $SRILM/bin/i686-m64/ngram-count -interpolate SMOOTHING -order 5 -text 
TRAINING-DATA -unk -lm lm.gz
+        
+  Where SMOOTHING is `-kndiscount`, or `-wbdiscount` if `--witten-bell` is 
passed to the pipeline.
+  
+  [BerkeleyLM java 
class](http://code.google.com/p/berkeleylm/source/browse/trunk/src/edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText.java)
+  is also available. It computes a Kneser-Ney LM with a constant discounting 
(0.75) and no count
+  thresholding.  The flag `--buildlm-mem` can be used to control how much 
memory is allocated to the
+  Java process.  The default is "2g", but you will want to increase it for 
larger language models.
+  
+  A language model built from the target side of the training data is placed 
at `$RUNDIR/lm.gz`.  
+
+## Interlude: decoder arguments
+
+Running the decoder is done in both the tuning stage and the testing stage.  A 
critical point is
+that you have to give the decoder enough memory to run.  Joshua can be very 
memory-intensive, in
+particular when decoding with large grammars and large language models.  The 
default amount of
+memory is 3100m, which is likely not enough (especially if you are decoding 
with SAMT grammar).  You
+can alter the amount of memory for Joshua using the `--joshua-mem MEM` 
argument, where MEM is a Java
+memory specification (passed to its `-Xmx` flag).
+
+## <a id="tuning" /> 6. TUNING
+
+Two optimizers are provided with Joshua: MERT and PRO (`--tuner {mert,pro}`).  
If Moses is
+installed, you can also use Cherry & Foster's k-best batch MIRA (`--tuner 
mira`, recommended).
+Tuning is run till convergence in the `$RUNDIR/tune` directory.
+
+When tuning is finished, each final configuration file can be found at either
+
+    $RUNDIR/tune/joshua.config.final
+
+## <a id="testing" /> 7. Testing 
+
+For each of the tuner runs, Joshua takes the tuner output file and decodes the 
test set.  If you
+like, you can also apply minimum Bayes-risk decoding to the decoder output 
with `--mbr`.  This
+usually yields about 0.3 - 0.5 BLEU points, but is time-consuming.
+
+After decoding the test set with each set of tuned weights, Joshua computes 
the mean BLEU score,
+writes it to `$RUNDIR/test/final-bleu`, and cats it. It also writes a file
+`$RUNDIR/test/final-times` containing a summary of runtime information. That's 
the end of the pipeline!
+
+Joshua also supports decoding further test sets.  This is enabled by rerunning 
the pipeline with a
+number of arguments:
+
+-   `--first-step TEST`
+
+    This tells the decoder to start at the test step.
+
+-   `--joshua-config CONFIG`
+
+    A tuned parameter file is required.  This file will be the output of some 
prior tuning run.
+    Necessary pathnames and so on will be adjusted.
+    
+## <a id="analysis"> 8. ANALYSIS
+
+If you have used the suggested layout, with a number of related runs all 
contained in a common
+directory with sequential numbers, you can use the script 
`$JOSHUA/scripts/training/summarize.pl` to
+display a summary of the mean BLEU scores from all runs, along with the text 
you placed in the run
+README file (using the pipeline's `--readme TEXT` flag).
+
+## COMMON USE CASES AND PITFALLS 
+
+- If the pipeline dies at the "thrax-run" stage with an error like the 
following:
+
+      JOB FAILED (return code 1) 
+      hadoop/bin/hadoop: line 47: 
+      /some/path/to/a/directory/hadoop/bin/hadoop-config.sh: No such file or 
directory 
+      Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/hadoop/fs/FsShell 
+      Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.fs.FsShell 
+      
+  This occurs if the `$HADOOP` environment variable is set but does not point 
to a working
+  Hadoop installation.  To fix it, make sure to unset the variable:
+  
+      # in bash
+      unset HADOOP
+      
+  and then rerun the pipeline with the same invocation.
+
+- Memory usage is a major consideration in decoding with Joshua and 
hierarchical grammars.  In
+  particular, SAMT grammars often require a large amount of memory.  Many 
steps have been taken to
+  reduce memory usage, including beam settings and test-set- and 
sentence-level filtering of
+  grammars.  However, memory usage can still be in the tens of gigabytes.
+
+  To accommodate this kind of variation, the pipeline script allows you to 
specify both (a) the
+  amount of memory used by the Joshua decoder instance and (b) the amount of 
memory required of
+  nodes obtained by the qsub command.  These are accomplished with the 
`--joshua-mem` MEM and
+  `--qsub-args` ARGS commands.  For example,
+
+      pipeline.pl --joshua-mem 32g --qsub-args "-l pvmem=32g -q himem.q" ...
+
+  Also, should Thrax fail, it might be due to a memory restriction. By 
default, Thrax requests 2 GB
+  from the Hadoop server. If more memory is needed, set the memory requirement 
with the
+  `--hadoop-mem` in the same way as the `--joshua-mem` option is used.
+
+- Other pitfalls and advice will be added as it is discovered.
+
+## FEEDBACK 
+
+Please email [email protected] with problems or suggestions.
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/quick-start.md
----------------------------------------------------------------------
diff --git a/6.0/quick-start.md b/6.0/quick-start.md
new file mode 100644
index 0000000..53814ae
--- /dev/null
+++ b/6.0/quick-start.md
@@ -0,0 +1,59 @@
+---
+layout: default6
+title: Quick Start
+---
+
+If you just want to use Joshua to translate data, the quickest way is
+to download a [pre-built model](/language-packs/). 
+
+If not language pack is available, or if you have your own parallel
+data that you want to train the translation engine on, then you have
+to build your own model. This takes a bit more knowledge and effort,
+but is made easier with Joshua's [pipeline script](pipeline.html),
+which runs all the steps of preparing data, aligning it, and
+extracting and tuning component models. 
+
+Detailed information about running the pipeline can be found in
+[the pipeline documentation](/6.0/pipeline.html), but as a quick
+start, you can build a simple Bengali--English model by following
+these instructions.
+
+*NOTE: We suggest you build models outside the `$JOSHUA` directory*.
+
+First, download the dataset:
+   
+    mkdir -p ~/models/bn-en/
+    cd ~/models/bn-en
+    wget -q 
https://github.com/joshua-decoder/indian-parallel-corpora/archive/1.0.tar.gz
+    tar xzf indian-parallel-corpora-1.0.tar.gz
+    ln -s indian-parallel-corpora-1.0 input
+
+Then, train and test a model
+
+    $JOSHUA/bin/pipeline.pl --source bn --target en \
+        --type hiero \
+        --no-prepare --aligner berkeley \
+        --corpus input/bn-en/tok/training.bn-en \
+        --tune input/bn-en/tok/dev.bn-en \
+        --test input/bn-en/tok/devtest.bn-en
+
+This will align the data with the Berkeley aligner, build a Hiero
+model, tune with MERT, decode the test sets, and reports results that
+should correspond with what you find on
+[the Indian Parallel Corpora page](/indian-parallel-corpora/). For
+more details, including information on the many options available with
+the pipeline script, please see [its documentation page](pipeline.html).
+
+Finally, you can export the full model as a language pack:
+
+    ./run-bundler.py \
+      tune/joshua.config.final \
+      language-pack-bn-en \
+      --pack-tm grammar.gz
+      
+(or possibly `tune/1/joshua.config.final` if you're using an older version of
+the pipeline).
+
+This will create a [runnable model](bundle.html) in
+`language-pack-bn-en`. See the `README` file in that directory for
+information on how to run the decoder.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/server.md
----------------------------------------------------------------------
diff --git a/6.0/server.md b/6.0/server.md
new file mode 100644
index 0000000..f3d8da5
--- /dev/null
+++ b/6.0/server.md
@@ -0,0 +1,30 @@
+---
+layout: default6
+category: links
+title: Server mode
+---
+
+The Joshua decoder can be run as a TCP/IP server instead of a POSIX-style 
command-line tool. Clients can concurrently connect to a socket and receive a 
set of newline-separated outputs for a set of newline-separated inputs.
+
+Threading takes place both within and across requests.  Threads from the 
decoder pool are assigned in round-robin manner across requests, preventing 
starvation.
+
+
+# Invoking the server
+
+A running server is configured at invokation time. To start in server mode, 
run `joshua-decoder` with the option `-server-port [PORT]`. Additionally, the 
server can be configured in the same ways as when using the 
command-line-functionality.
+
+E.g.,
+
+    $JOSHUA/bin/joshua-decoder -server-port 10101 -mark-oovs false 
-output-format "%s" -threads 10
+
+## Using the server
+
+To test that the server is working, a set of inputs can be sent to the server 
from the command line. 
+
+The server, as configured in the example above, will then respond to requests 
on port 10101.  You can test it out with the `nc` utility:
+
+    wget -qO - http://cs.jhu.edu/~post/files/pg1023.txt | head -132 | tail -11 
| nc localhost 10101
+
+Since no model was loaded, this will just return the text to you as sent to 
the server.
+
+The `-server-port` option can also be used when creating a [bundled 
configuration](bundle.html) that will be run in server mode.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/thrax.md
----------------------------------------------------------------------
diff --git a/6.0/thrax.md b/6.0/thrax.md
new file mode 100644
index 0000000..dbcc71c
--- /dev/null
+++ b/6.0/thrax.md
@@ -0,0 +1,14 @@
+---
+layout: default6
+category: advanced
+title: Grammar extraction with Thrax
+---
+
+One day, this will hold Thrax documentation, including how to use Thrax, how 
to do grammar
+filtering, and details on the configuration file options.  It will also 
include details about our
+experience setting up and maintaining Hadoop cluster installations, knowledge 
wrought of hard-fought
+sweat and tears.
+
+In the meantime, please bother [Jonny Weese](http://cs.jhu.edu/~jonny/) if 
there is something you
+need to do that you don't understand.  You might also be able to dig up some 
information [on the old
+Thrax page](http://cs.jhu.edu/~jonny/thrax/).

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/tms.md
----------------------------------------------------------------------
diff --git a/6.0/tms.md b/6.0/tms.md
new file mode 100644
index 0000000..7ce5e9d
--- /dev/null
+++ b/6.0/tms.md
@@ -0,0 +1,106 @@
+---
+layout: default6
+category: advanced
+title: Building Translation Models
+---
+
+# Build a translation model
+
+Extracting a grammar from a large amount of data is a multi-step process. The 
first requirement is parallel data. The Europarl, Call Home, and Fisher corpora 
all contain parallel translations of Spanish and English sentences.
+
+We will copy (or symlink) the parallel source text files in a subdirectory 
called `input/`.
+
+Then, we concatenate all the training files on each side. The pipeline script 
normally does tokenization and normalization, but in this instance we have a 
custom tokenizer we need to apply to the source side, so we have to do it 
manually and then skip that step using the `pipeline.pl` option `--first-step 
alignment`.
+
+* to tokenize the English data, do
+
+    cat callhome.en europarl.en fisher.en > all.en | 
$JOSHUA/scripts/training/normalize-punctuation.pl en | 
$JOSHUA/scripts/training/penn-treebank-tokenizer.perl | 
$JOSHUA/scripts/lowercase.perl > all.norm.tok.lc.en
+
+The same can be done for the Spanish side of the input data:
+
+    cat callhome.es europarl.es fisher.es > all.es | 
$JOSHUA/scripts/training/normalize-punctuation.pl es | 
$JOSHUA/scripts/training/penn-treebank-tokenizer.perl | 
$JOSHUA/scripts/lowercase.perl > all.norm.tok.lc.es
+
+By the way, an alternative tokenizer is a Twitter tokenizer found in the 
[Jerboa](http://github.com/vandurme/jerboa) project.
+
+The final step in the training data preparation is to remove all examples in 
which either of the language sides is a blank line.
+
+    paste all.norm.tok.lc.es all.norm.tok.lc.en | grep -Pv "^\t|\t$" \
+      | ./splittabs.pl all.norm.tok.lc.noblanks.es all.norm.tok.lc.noblanks.en
+
+contents of `splittabls.pl` by Matt Post:
+
+    #!/usr/bin/perl
+
+    # splits on tab, printing respective chunks to the list of files given
+    # as script arguments
+
+    use FileHandle;
+
+    my @fh;
+    $| = 1;   # don't buffer output
+
+    if (@ARGV < 0) {
+      print "Usage: splittabs.pl < tabbed-file\n";
+      exit;
+    }
+
+    my @fh = map { get_filehandle($_) } @ARGV;
+    @ARGV = ();
+
+    while (my $line = <>) {
+      chomp($line);
+      my (@fields) = split(/\t/,$line,scalar @fh);
+
+      map { print {$fh[$_]} "$fields[$_]\n" } (0..$#fields);
+    }
+
+    sub get_filehandle {
+        my $file = shift;
+
+        if ($file eq "-") {
+            return *STDOUT;
+        } else {
+            local *FH;
+            open FH, ">$file" or die "can't open '$file' for writing";
+            return *FH;
+        }
+    }
+
+Now we can run the pipeline to extract the grammar. Run the following script:
+
+    #!/bin/bash
+
+    # this creates a grammar
+
+    # NEED:
+    # pair
+    # type
+
+    set -u
+
+    pair=es-en
+    type=hiero
+
+    #. ~/.bashrc
+
+    #basedir=$(pwd)
+
+    dir=grammar-$pair-$type
+
+    [[ ! -d $dir ]] && mkdir -p $dir
+    cd $dir
+
+    source=$(echo $pair | cut -d- -f 1)
+    target=$(echo $pair | cut -d- -f 2)
+
+    $JOSHUA/scripts/training/pipeline.pl \
+      --source $source \
+      --target $target \
+      --corpus 
/home/hltcoe/lorland/expts/scale12/model1/input/all.norm.tok.lc.noblanks \
+      --type $type \
+      --joshua-mem 100g \
+      --no-prepare \
+      --first-step align \
+      --last-step thrax \
+      --hadoop $HADOOP \
+      --threads 8 \

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/tutorial.md
----------------------------------------------------------------------
diff --git a/6.0/tutorial.md b/6.0/tutorial.md
new file mode 100644
index 0000000..482162f
--- /dev/null
+++ b/6.0/tutorial.md
@@ -0,0 +1,187 @@
+---
+layout: default6
+category: links
+title: Pipeline tutorial
+---
+
+This document will walk you through using the pipeline in a variety of 
scenarios. Once you've gained a
+sense for how the pipeline works, you can consult the [pipeline 
page](pipeline.html) for a number of
+other options available in the pipeline.
+
+## Download and Setup
+
+Download and install Joshua as described on the [quick start 
page](index.html), installing it under
+`~/code/`. Once you've done that, you should make sure you have the following 
environment variable set:
+
+    export JOSHUA=$HOME/code/joshua-v{{ site.data.joshua.release_version }}
+    export JAVA_HOME=/usr/java/default
+
+If you have a Hadoop installation, make sure you've set `$HADOOP` to point to 
it. For example, if the `hadoop` command is in `/usr/bin`,
+you should type
+
+    export HADOOP=/usr
+
+Joshua will find the binary and use it to submit to your hadoop cluster. If 
you don't have one, just
+make sure that HADOOP is unset, and Joshua will roll one out for you and run 
it in
+[standalone 
mode](https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html). 
+
+## A basic pipeline run
+
+For today's experiments, we'll be building a Spanish--English system using 
data included in the
+[Fisher and CALLHOME translation corpus](/data/fisher-callhome-corpus/). This
+data was collected by translating transcribed speech from previous LDC 
releases.
+
+Download the data and install it somewhere:
+
+    cd ~/data
+    wget --no-check -O fisher-callhome-corpus.zip 
https://github.com/joshua-decoder/fisher-callhome-corpus/archive/master.zip
+    unzip fisher-callhome-corpus.zip
+
+Then define the environment variable `$FISHER` to point to it:
+
+    cd ~/data/fisher-callhome-corpus-master
+    export FISHER=$(pwd)
+    
+### Preparing the data
+
+Inside the tarball is the Fisher and CALLHOME Spanish--English data, which 
includes Kaldi-provided
+ASR output and English translations on the Fisher and CALLHOME  dataset 
transcriptions. Because of
+licensing restrictions, we cannot distribute the Spanish transcripts, but if 
you have an LDC site
+license, a script is provided to build them. You can type:
+
+    ./bin/build_fisher.sh /export/common/data/corpora/LDC/LDC2010T04
+
+Where the first argument is the path to your LDC data release. This will 
create the files in `corpus/ldc`.
+
+In `$FISHER/corpus`, there are a set of parallel directories for LDC 
transcripts (`ldc`), ASR output
+(`asr`), oracle ASR output (`oracle`), and ASR lattice output (`plf`). The 
files look like this:
+
+    $ ls corpus/ldc
+    callhome_devtest.en  fisher_dev2.en.2  fisher_dev.en.2   fisher_test.en.2
+    callhome_evltest.en  fisher_dev2.en.3  fisher_dev.en.3   fisher_test.en.3
+    callhome_train.en    fisher_dev2.es    fisher_dev.es     fisher_test.es
+    fisher_dev2.en.0     fisher_dev.en.0   fisher_test.en.0  fisher_train.en
+    fisher_dev2.en.1     fisher_dev.en.1   fisher_test.en.1  fisher_train.es
+
+If you don't have the LDC transcripts, you can use the data in `corpus/asr` 
instead. We will now use
+this data to build our own Spanish--English model using Joshua's pipeline.
+    
+### Run the pipeline
+
+Create an experiments directory for containing your first experiment. *Note: 
it's important that
+this **not** be inside your `$JOSHUA` directory*.
+
+    mkdir ~/expts/joshua
+    cd ~/expts/joshua
+    
+We will now create the baseline run, using a particular directory structure 
for experiments that
+will allow us to take advantage of scripts provided with Joshua for displaying 
the results of many
+related experiments. Because this can take quite some time to run, we are 
going to reduce the model
+by quite a bit by 
+restriction: Joshua will only use sentences in the training sets with ten or 
fewer words on either
+side (Spanish or English):
+
+    cd ~/expts/joshua
+    $JOSHUA/bin/pipeline.pl           \
+      --rundir 1                      \
+      --readme "Baseline Hiero run"   \
+      --source es                     \
+      --target en                     \
+      --type hiero                    \
+      --corpus $FISHER/corpus/ldc/fisher_train \
+      --tune $FISHER/corpus/ldc/fisher_dev \
+      --test $FISHER/corpus/ldc/fisher_dev2 \
+      --maxlen 10 \
+      --lm-order 3
+      
+This will start the pipeline building a Spanish--English translation system 
constructed from the
+training data and a dictionary, tuned against dev, and tested against devtest. 
It will use the
+default values for most of the pipeline: 
[GIZA++](https://code.google.com/p/giza-pp/) for alignment,
+KenLM's `lmplz` for building the language model, Z-MERT for tuning, KenLM with 
left-state
+minimization for representing LM state in the decoder, and so on. We change 
the order of the n-gram
+model to 3 (from its default of 5) because there is not enough data to build a 
5-gram LM.
+
+A few notes:
+
+- This will likely take many hours to run, especially if you don't have a 
Hadoop cluster.
+
+- If you are running on Mac OS X, KenLM's `lmplz` will not build due to the 
absence of static
+  libraries. In that case, you should add the flag `--lm-gen srilm` 
(recommended, if SRILM is
+  installed) or `--lm-gen berkeleylm`.
+
+### Variations
+
+Once that is finished, you will have a baseline model. From there, you might 
wish to try variations
+of the baseline model. Here are some examples of what you could vary:
+
+- Build an SAMT model (`--type samt`), GKHM model (`--type ghkm`), or phrasal 
ITG model (`--type phrasal`) 
+   
+- Use the Berkeley aligner instead of GIZA++ (`--aligner berkeley`)
+   
+- Build the language model with BerkeleyLM (`--lm-gen srilm`) instead of KenLM 
(the default)
+
+- Change the order of the LM from the default of 5 (`--lm-order 4`)
+
+- Tune with MIRA instead of MERT (`--tuner mira`). This requires that Moses is 
installed.
+   
+- Decode with a wider beam (`--joshua-args '-pop-limit 200'`) (the default is 
100)
+
+- Add the provided BN-EN dictionary to the training data (add another 
`--corpus` line, e.g., `--corpus $FISHER/bn-en/dict.bn-en`)
+
+To do this, we will create new runs that partially reuse the results of 
previous runs. This is
+possible by doing two things: (1) incrementing the run directory and providing 
an updated README
+note; (2) telling the pipeline which of the many steps of the pipeline to 
begin at; and (3)
+providing the needed dependencies.
+
+# A second run
+
+Let's begin by changing the tuner, to see what effect that has. To do so, we 
change the run
+directory, tell the pipeline to start at the tuning step, and provide the 
needed dependencies:
+
+    $JOSHUA/bin/pipeline.pl           \
+      --rundir 2                      \
+      --readme "Tuning with MIRA"     \
+      --source bn                     \
+      --target en                     \
+      --corpus $FISHER/bn-en/tok/training.bn-en \
+      --tune $FISHER/bn-en/tok/dev.bn-en        \
+      --test $FISHER/bn-en/tok/devtest.bn-en    \
+      --first-step tune \
+      --tuner mira \
+      --grammar 1/grammar.gz \
+      --no-corpus-lm \
+      --lmfile 1/lm.gz
+      
+ Here, we have essentially the same invocation, but we have told the pipeline 
to use a different
+ MIRA, to start with tuning, and have provided it with the language model file 
and grammar it needs
+ to execute the tuning step. 
+ 
+ Note that we have also told it not to build a language model. This is 
necessary because the
+ pipeline always builds an LM on the target side of the training data, if 
provided, but we are
+ supplying the language model that was already built. We could equivalently 
have removed the
+ `--corpus` line.
+ 
+## Changing the model type
+
+Let's compare the Hiero model we've already built to an SAMT model. We have to 
reextract the
+grammar, but can reuse the alignments and the language model:
+
+    $JOSHUA/bin/pipeline.pl           \
+      --rundir 3                      \
+      --readme "Baseline SAMT model"  \
+      --source bn                     \
+      --target en                     \
+      --corpus $FISHER/bn-en/tok/training.bn-en \
+      --tune $FISHER/bn-en/tok/dev.bn-en        \
+      --test $FISHER/bn-en/tok/devtest.bn-en    \
+      --alignment 1/alignments/training.align   \
+      --first-step parse \
+      --no-corpus-lm \
+      --lmfile 1/lm.gz
+
+See [the pipeline script page](pipeline.html#steps) for a list of all the 
steps.
+
+## Analyzing the results
+
+We now have three runs, in subdirectories 1, 2, and 3. We can display summary 
results from them
+using the `$JOSHUA/scripts/training/summarize.pl` script.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/6.0/whats-new.md
----------------------------------------------------------------------
diff --git a/6.0/whats-new.md b/6.0/whats-new.md
new file mode 100644
index 0000000..c145fd5
--- /dev/null
+++ b/6.0/whats-new.md
@@ -0,0 +1,12 @@
+---
+layout: default6
+title: What's New
+---
+
+Joshua 6.0 introduces a number of new features and improvements.
+
+- A new phrase-based decoder that is as fast as Moses
+- Significantly faster hierarchical decoding
+- Support for class-based language modeling
+- Reflection-based loading of feature functions for super-easy
+  development of new features

[15/18] incubator-joshua-site git commit: Initial import of joshua-decoder.github.com site to Apache

Reply via email to