[31/44] incubator-joshua-site git commit: First attempt

mjpost Fri, 08 Apr 2016 20:11:01 -0700

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/6.0/pipeline.html
----------------------------------------------------------------------
diff --git a/6.0/pipeline.html b/6.0/pipeline.html
new file mode 100644
index 0000000..f10f3fa
--- /dev/null
+++ b/6.0/pipeline.html
@@ -0,0 +1,966 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8">
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <meta name="description" content="">
+    <meta name="author" content="">
+    <link rel="icon" href="../../favicon.ico">
+
+    <title>Joshua Documentation | The Joshua Pipeline</title>
+
+    <!-- Bootstrap core CSS -->
+    <link href="/dist/css/bootstrap.min.css" rel="stylesheet">
+
+    <!-- Custom styles for this template -->
+    <link href="/joshua6.css" rel="stylesheet">
+  </head>
+
+  <body>
+
+    <div class="blog-masthead">
+      <div class="container">
+        <nav class="blog-nav">
+          <!-- <a class="blog-nav-item active" href="#">Joshua</a> -->
+          <a class="blog-nav-item" href="/">Joshua</a>
+          <!-- <a class="blog-nav-item" href="/6.0/whats-new.html">New 
features</a> -->
+          <a class="blog-nav-item" href="/language-packs/">Language packs</a>
+          <a class="blog-nav-item" href="/data/">Datasets</a>
+          <a class="blog-nav-item" href="/support/">Support</a>
+          <a class="blog-nav-item" href="/contributors.html">Contributors</a>
+        </nav>
+      </div>
+    </div>
+
+    <div class="container">
+
+      <div class="row">
+
+        <div class="col-sm-2">
+          <div class="sidebar-module">
+            <!-- <h4>About</h4> -->
+            <center>
+            <img src="/images/joshua-logo-small.png" />
+            <p>Joshua machine translation toolkit</p>
+            </center>
+          </div>
+          <hr>
+          <center>
+            <a href="/releases/current/" target="_blank"><button 
class="button">Download Joshua 6.0.5</button></a>
+            <br />
+            <a href="/releases/runtime/" target="_blank"><button 
class="button">Runtime only version</button></a>
+            <p>Released November 5, 2015</p>
+          </center>
+          <hr>
+          <!-- <div class="sidebar-module"> -->
+          <!--   <span id="download"> -->
+          <!--     <a 
href="http://joshua-decoder.org/downloads/joshua-6.0.tgz";>Download</a> -->
+          <!--   </span> -->
+          <!-- </div> -->
+          <div class="sidebar-module">
+            <h4>Using Joshua</h4>
+            <ol class="list-unstyled">
+              <li><a href="/6.0/install.html">Installation</a></li>
+              <li><a href="/6.0/quick-start.html">Quick Start</a></li>
+            </ol>
+          </div>
+          <hr>
+          <div class="sidebar-module">
+            <h4>Building new models</h4>
+            <ol class="list-unstyled">
+              <li><a href="/6.0/pipeline.html">Pipeline</a></li>
+              <li><a href="/6.0/tutorial.html">Tutorial</a></li>
+              <li><a href="/6.0/faq.html">FAQ</a></li>
+            </ol>
+          </div>
+<!--
+          <div class="sidebar-module">
+            <h4>Phrase-based</h4>
+            <ol class="list-unstyled">
+              <li><a href="/6.0/phrase.html">Training</a></li>
+            </ol>
+          </div>
+-->
+          <hr>
+          <div class="sidebar-module">
+            <h4>Advanced</h4>
+            <ol class="list-unstyled">
+              <li><a href="/6.0/bundle.html">Building language packs</a></li>
+              <li><a href="/6.0/decoder.html">Decoder options</a></li>
+              <li><a href="/6.0/file-formats.html">File formats</a></li>
+              <li><a href="/6.0/packing.html">Packing TMs</a></li>
+              <li><a href="/6.0/large-lms.html">Building large LMs</a></li>
+            </ol>
+          </div>
+
+          <hr> 
+          <div class="sidebar-module">
+            <h4>Developer</h4>
+            <ol class="list-unstyled">              
+               <li><a 
href="https://github.com/joshua-decoder/joshua";>Github</a></li>
+               <li><a 
href="http://cs.jhu.edu/~post/joshua-docs";>Javadoc</a></li>
+               <li><a 
href="https://groups.google.com/forum/?fromgroups#!forum/joshua_developers";>Mailing
 list</a></li>              
+            </ol>
+          </div>
+
+        </div><!-- /.blog-sidebar -->
+
+        
+        <div class="col-sm-8 blog-main">
+        
+
+          <div class="blog-title">
+            <h2>The Joshua Pipeline</h2>
+          </div>
+          
+          <div class="blog-post">
+
+            <p><em>Please note that the Joshua 6.0.3 included some big changes 
to directory organization of the
+ pipelineâs files.</em></p>
+
+<p>This page describes the Joshua pipeline script, which manages the 
complexity of training and
+evaluating machine translation systems.  The pipeline eases the pain of two 
related tasks in
+statistical machine translation (SMT) research:</p>
+
+<ul>
+  <li>
+    <p>Training SMT systems involves a complicated process of interacting 
steps that are
+time-consuming and prone to failure.</p>
+  </li>
+  <li>
+    <p>Developing and testing new techniques requires varying parameters at 
different points in the
+pipeline. Earlier results (which are often expensive) need not be 
recomputed.</p>
+  </li>
+</ul>
+
+<p>To facilitate these tasks, the pipeline script:</p>
+
+<ul>
+  <li>
+    <p>Runs the complete SMT pipeline, from corpus normalization and 
tokenization, through alignment,
+model building, tuning, test-set decoding, and evaluation.</p>
+  </li>
+  <li>
+    <p>Caches the results of intermediate steps (using robust SHA-1 checksums 
on dependencies), so the
+pipeline can be debugged or shared across similar runs while doing away with 
time spent
+recomputing expensive steps.</p>
+  </li>
+  <li>
+    <p>Allows you to jump into and out of the pipeline at a set of predefined 
places (e.g., the alignment
+stage), so long as you provide the missing dependencies.</p>
+  </li>
+</ul>
+
+<p>The Joshua pipeline script is designed in the spirit of Mosesâ <code 
class="highlighter-rouge">train-model.pl</code>, and shares
+(and has borrowed) many of its features.  It is not as extensive as Mosesâ
+<a href="http://www.statmt.org/moses/?n=FactoredTraining.EMS";>Experiment 
Management System</a>, which allows
+the user to define arbitrary execution dependency graphs. However, it is 
significantly simpler to
+use, allowing many systems to be built with a single command (that may run for 
days or weeks).</p>
+
+<h2 id="dependencies">Dependencies</h2>
+
+<p>The pipeline has no <em>required</em> external dependencies.  However, it 
has support for a number of
+external packages, some of which are included with Joshua.</p>
+
+<ul>
+  <li>
+    <p><a href="http://code.google.com/p/giza-pp/";>GIZA++</a> (included)</p>
+
+    <p>GIZA++ is the default aligner.  It is included with Joshua, and should 
compile successfully when
+you typed <code class="highlighter-rouge">ant</code> from the Joshua root 
directory.  It is not required because you can use the
+(included) Berkeley aligner (<code class="highlighter-rouge">--aligner 
berkeley</code>). We have recently also provided support
+for the <a href="http://code.google.com/p/jacana-xy/wiki/JacanaXY";>Jacana-XY 
aligner</a> (<code class="highlighter-rouge">--aligner
+jacana</code>). </p>
+  </li>
+  <li>
+    <p><a href="http://hadoop.apache.org/";>Hadoop</a> (included)</p>
+
+    <p>The pipeline uses the <a href="thrax.html">Thrax grammar extractor</a>, 
which is built on Hadoop.  If you
+have a Hadoop installation, simply ensure that the <code 
class="highlighter-rouge">$HADOOP</code> environment variable is defined, and
+the pipeline will use it automatically at the grammar extraction step.  If you 
are going to
+attempt to extract very large grammars, it is best to have a good-sized Hadoop 
installation.</p>
+
+    <p>(If you do not have a Hadoop installation, you might consider setting 
one up.  Hadoop can be
+installed in a
+<a 
href="http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed";>âpseudo-distributedâ</a>
+mode that allows it to use just a few machines or a number of processors on a 
single machine.
+The main issue is to ensure that there are a lot of independent physical 
disks, since in our
+experience Hadoop starts to exhibit lots of hard-to-trace problems if there is 
too much demand on
+the disks.)</p>
+
+    <p>If you donât have a Hadoop installation, there are still no worries.  
The pipeline will unroll a
+standalone installation and use it to extract your grammar.  This behavior 
will be triggered if
+<code class="highlighter-rouge">$HADOOP</code> is undefined.</p>
+  </li>
+  <li>
+    <p><a href="http://statmt.org/moses/";>Moses</a> (not included). Moses is 
needed
+if you wish to use its âkbmiraâ tuner (âtuner kbmira), or if you
+wish to build phrase-based models.</p>
+  </li>
+  <li>
+    <p><a href="http://www.speech.sri.com/projects/srilm/";>SRILM</a> (not 
included; not needed; not recommended)</p>
+
+    <p>By default, the pipeline uses the included <a 
href="https://kheafield.com/code/kenlm/";>KenLM</a> for
+building (and also querying) language models. Joshua also includes a Java 
program from the
+<a href="http://code.google.com/p/berkeleylm/";>Berkeley LM</a> package that 
contains code for constructing a
+Kneser-Ney-smoothed language model in ARPA format from the target side of your 
training data.<br />
+There is no need to use SRILM, but if you do wish to use it, you need to do 
the following:</p>
+
+    <ol>
+      <li>Install SRILM and set the <code 
class="highlighter-rouge">$SRILM</code> environment variable to point to its 
installed location.</li>
+      <li>Add the <code class="highlighter-rouge">--lm-gen srilm</code> flag 
to your pipeline invocation.</li>
+    </ol>
+
+    <p>More information on this is available in the <a href="#lm">LM building 
section of the pipeline</a>.  SRILM
+is not used for representing language models during decoding (and in fact is 
not supported,
+having been supplanted by <a href="http://kheafield.com/code/kenlm/";>KenLM</a> 
(the default) and
+BerkeleyLM).</p>
+  </li>
+</ul>
+
+<p>After installing any dependencies, follow the brief instructions on
+the <a href="install.html">installation page</a>, and then you are ready to 
build
+models. </p>
+
+<h2 id="a-basic-pipeline-run">A basic pipeline run</h2>
+
+<p>The pipeline takes a set of inputs (training, tuning, and test data), and 
creates a set of
+intermediate files in the <em>run directory</em>.  By default, the run 
directory is the current directory,
+but it can be changed with the <code class="highlighter-rouge">--rundir</code> 
parameter.</p>
+
+<p>For this quick start, we will be working with the example that can be found 
in
+<code class="highlighter-rouge">$JOSHUA/examples/training</code>.  This 
example contains 1,000 sentences of Urdu-English data (the full
+dataset is available as part of the
+<a href="/indian-parallel-corpora/">Indian languages parallel corpora</a> with
+100-sentence tuning and test sets with four references each.</p>
+
+<p>Running the pipeline requires two main steps: data preparation and 
invocation.</p>
+
+<ol>
+  <li>
+    <p>Prepare your data.  The pipeline script needs to be told where to find 
the raw training, tuning,
+and test data.  A good convention is to place these files in an input/ 
subdirectory of your runâs
+working directory (NOTE: do not use <code 
class="highlighter-rouge">data/</code>, since a directory of that name is 
created and used
+by the pipeline itself for storing processed files).  The expected format (for 
each of training,
+tuning, and test) is a pair of files that share a common path prefix and are 
distinguished by
+their extension, e.g.,</p>
+
+    <div class="highlighter-rouge"><pre class="highlight"><code>input/
+      train.SOURCE
+      train.TARGET
+      tune.SOURCE
+      tune.TARGET
+      test.SOURCE
+      test.TARGET
+</code></pre>
+    </div>
+
+    <p>These files should be parallel at the sentence level (with one sentence 
per line), should be in
+UTF-8, and should be untokenized (tokenization occurs in the pipeline).  
SOURCE and TARGET denote
+variables that should be replaced with the actual target and source language 
abbreviations (e.g.,
+âurâ and âenâ).</p>
+  </li>
+  <li>
+    <p>Run the pipeline.  The following is the minimal invocation to run the 
complete pipeline:</p>
+
+    <div class="highlighter-rouge"><pre 
class="highlight"><code>$JOSHUA/bin/pipeline.pl  \
+  --rundir .             \
+  --type hiero           \
+  --corpus input/train   \
+  --tune input/tune      \
+  --test input/devtest   \
+  --source SOURCE        \
+  --target TARGET
+</code></pre>
+    </div>
+
+    <p>The <code class="highlighter-rouge">--corpus</code>, <code 
class="highlighter-rouge">--tune</code>, and <code 
class="highlighter-rouge">--test</code> flags define file prefixes that are 
concatened with the
+language extensions given by <code class="highlighter-rouge">--target</code> 
and <code class="highlighter-rouge">--source</code> (with a â.â in 
between).  Note the
+correspondences with the files defined in the first step above.  The prefixes 
can be either
+absolute or relative pathnames.  This particular invocation assumes that a 
subdirectory <code class="highlighter-rouge">input/</code>
+exists in the current directory, that you are translating from a language 
identified âurâ
+extension to a language identified by the âenâ extension, that the 
training data can be found at
+<code class="highlighter-rouge">input/train.en</code> and <code 
class="highlighter-rouge">input/train.ur</code>, and so on.</p>
+  </li>
+</ol>
+
+<p><em>Donât</em> run the pipeline directly from <code 
class="highlighter-rouge">$JOSHUA</code>, or, for that matter, in any directory 
with lots of other files.
+This can cause problems because the pipeline creates lots of files under <code 
class="highlighter-rouge">--rundir</code> that can clobber existing files.
+You should run experiments in a clean directory.
+For example, if you have Joshua installed in <code 
class="highlighter-rouge">$HOME/code/joshua</code>, manage your runs in a 
different location, such as <code 
class="highlighter-rouge">$HOME/expts/joshua</code>.</p>
+
+<p>Assuming no problems arise, this command will run the complete pipeline in 
about 20 minutes,
+producing BLEU scores at the end.  As it runs, you will see output that looks 
like the following:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>[train-copy-en] 
rebuilding...
+  dep=/Users/post/code/joshua/test/pipeline/input/train.en 
+  dep=data/train/train.en.gz [NOT FOUND]
+  cmd=cat /Users/post/code/joshua/test/pipeline/input/train.en | gzip -9n &gt; 
data/train/train.en.gz
+  took 0 seconds (0s)
+[train-copy-ur] rebuilding...
+  dep=/Users/post/code/joshua/test/pipeline/input/train.ur 
+  dep=data/train/train.ur.gz [NOT FOUND]
+  cmd=cat /Users/post/code/joshua/test/pipeline/input/train.ur | gzip -9n &gt; 
data/train/train.ur.gz
+  took 0 seconds (0s)
+...
+</code></pre>
+</div>
+
+<p>And in the current directory, you will see the following files (among
+other files, including intermediate files
+generated by the individual sub-steps).</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>data/
+    train/
+        corpus.ur
+        corpus.en
+        thrax-input-file
+    tune/
+        corpus.ur -&gt; tune.tok.lc.ur
+        corpus.en -&gt; tune.tok.lc.en
+        grammar.filtered.gz
+        grammar.glue
+    test/
+        corpus.ur -&gt; test.tok.lc.ur
+        corpus.en -&gt; test.tok.lc.en
+        grammar.filtered.gz
+        grammar.glue
+alignments/
+    0/
+        [giza/berkeley aligner output files]
+    1/
+    ...
+    training.align
+thrax-hiero.conf
+thrax.log
+grammar.gz
+lm.gz
+tune/
+     decoder_command
+     model/
+           [model files]
+     params.txt
+     joshua.log
+     mert.log
+     joshua.config.final
+     final-bleu
+test/
+     model/
+           [model files]
+     output
+     final-bleu
+</code></pre>
+</div>
+
+<p>These files will be described in more detail in subsequent sections of this 
tutorial.</p>
+
+<p>Another useful flag is the <code class="highlighter-rouge">--rundir 
DIR</code> flag, which chdir()s to the specified directory before
+running the pipeline.  By default the rundir is the current directory.  
Changing it can be useful
+for organizing related pipeline runs.  In fact, we highly recommend
+that you organize your runs using consecutive integers, also taking a
+minute to pass a short note with the <code 
class="highlighter-rouge">--readme</code> flag, which allows you
+to quickly generate reports on <a href="#managing">groups of related 
experiments</a>.
+Relative paths specified to other flags (e.g., to <code 
class="highlighter-rouge">--corpus</code>
+or <code class="highlighter-rouge">--lmfile</code>) are relative to the 
directory the pipeline was called <em>from</em>, not the rundir itself
+(unless they happen to be the same, of course).</p>
+
+<p>The complete pipeline comprises many tens of small steps, which can be 
grouped together into a set
+of traditional pipeline tasks:</p>
+
+<ol>
+  <li><a href="#prep">Data preparation</a></li>
+  <li><a href="#alignment">Alignment</a></li>
+  <li><a href="#parsing">Parsing</a> (syntax-based grammars only)</li>
+  <li><a href="#tm">Grammar extraction</a></li>
+  <li><a href="#lm">Language model building</a></li>
+  <li><a href="#tuning">Tuning</a></li>
+  <li><a href="#testing">Testing</a></li>
+  <li><a href="#analysis">Analysis</a></li>
+</ol>
+
+<p>These steps are discussed below, after a few intervening sections about 
high-level details of the
+pipeline.</p>
+
+<h2 id="a-idmanaging--managing-groups-of-experiments"><a id="managing"></a> 
Managing groups of experiments</h2>
+
+<p>The real utility of the pipeline comes when you use it to manage groups of 
experiments. Typically,
+there is a held-out test set, and we want to vary a number of training 
parameters to determine what
+effect this has on BLEU scores or some other metric. Joshua comes with a script
+<code class="highlighter-rouge">$JOSHUA/scripts/training/summarize.pl</code> 
that collects information from a group of runs and reports
+them to you. This script works so long as you organize your runs as 
follows:</p>
+
+<ol>
+  <li>
+    <p>Your runs should be grouped together in a root directory, which Iâll 
call <code class="highlighter-rouge">$EXPDIR</code>.</p>
+  </li>
+  <li>
+    <p>For comparison purposes, the runs should all be evaluated on the same 
test set.</p>
+  </li>
+  <li>
+    <p>Each run in the run group should be in its own numbered directory, 
shown with the files used by
+the summarize script:</p>
+
+    <div class="highlighter-rouge"><pre class="highlight"><code>$RUNDIR/
+    1/
+        README.txt
+        test/
+            final-bleu
+            final-times
+        [other files]
+    2/
+        README.txt
+        test/
+            final-bleu
+            final-times
+        [other files]
+        ...
+</code></pre>
+    </div>
+  </li>
+</ol>
+
+<p>You can get such directories using the <code 
class="highlighter-rouge">--rundir N</code> flag to the pipeline. </p>
+
+<p>Run directories can build off each other. For example, <code 
class="highlighter-rouge">1/</code> might contain a complete baseline
+run. If you wanted to just change the tuner, you donât need to rerun the 
aligner and model builder,
+so you can reuse the results by supplying the second run with the information 
it needs that was
+computed in step 1:</p>
+
+<div class="highlighter-rouge"><pre 
class="highlight"><code>$JOSHUA/bin/pipeline.pl \
+  --first-step tune \
+  --grammar 1/grammar.gz \
+  ...
+</code></pre>
+</div>
+
+<p>More details are below.</p>
+
+<h2 id="grammar-options">Grammar options</h2>
+
+<p>Hierarchical Joshua can extract three types of grammars: Hiero
+grammars, GHKM, and SAMT grammars.  As described on the
+<a href="file-formats.html">file formats page</a>, all of them are encoded into
+the same file format, but they differ in terms of the richness of
+their nonterminal sets.</p>
+
+<p>Hiero grammars make use of a single nonterminals, and are extracted by 
computing phrases from
+word-based alignments and then subtracting out phrase differences.  More 
detail can be found in
+<a 
href="http://www.mitpressjournals.org/doi/abs/10.1162/coli.2007.33.2.201";>Chiang
 (2007) [PDF]</a>.
+<a href="http://www.isi.edu/%7Emarcu/papers/cr_ghkm_naacl04.pdf";>GHKM</a> (new 
with 5.0) and
+<a href="http://www.cs.cmu.edu/~zollmann/samt/";>SAMT</a> grammars make use of 
a source- or target-side parse
+tree on the training data, differing in the way they extract rules using these 
trees: GHKM extracts
+synchronous tree substitution grammar rules rooted in a subset of the tree 
constituents, whereas
+SAMT projects constituent labels down onto phrases.  SAMT grammars are usually 
many times larger and
+are much slower to decode with, but sometimes increase BLEU score.  Both 
grammar formats are
+extracted with the <a href="thrax.html">Thrax software</a>.</p>
+
+<p>By default, the Joshua pipeline extract a Hiero grammar, but this can be 
altered with the <code class="highlighter-rouge">--type
+(ghkm|samt)</code> flag. For GHKM grammars, the default is to use
+<a 
href="http://www-nlp.stanford.edu/~mgalley/software/stanford-ghkm-latest.tar.gz";>Michel
 Galleyâs extractor</a>,
+but you can also use Mosesâ extractor with <code 
class="highlighter-rouge">--ghkm-extractor moses</code>. Galleyâs extractor 
only outputs
+two features, so the scores tend to be significantly lower than that of 
Mosesâ.</p>
+
+<p>Joshua (new in version 6) also includes an unlexicalized phrase-based
+decoder. Building a phrase-based model requires you to have Moses
+installed, since its <code class="highlighter-rouge">train-model.perl</code> 
script is used to extract the
+phrase table. You can enable this by defining the <code 
class="highlighter-rouge">$MOSES</code> environment
+variable and then specifying <code class="highlighter-rouge">--type 
phrase</code>.</p>
+
+<h2 id="other-high-level-options">Other high-level options</h2>
+
+<p>The following command-line arguments control run-time behavior of multiple 
steps:</p>
+
+<ul>
+  <li>
+    <p><code class="highlighter-rouge">--threads N</code> (1)</p>
+
+    <p>This enables multithreaded operation for a number of steps: alignment 
(with GIZA, max two
+threads), parsing, and decoding (any number of threads)</p>
+  </li>
+  <li>
+    <p><code class="highlighter-rouge">--jobs N</code> (1)</p>
+
+    <p>This enables parallel operation over a cluster using the qsub command.  
This feature is not
+well-documented at this point, but you will likely want to edit the file
+<code 
class="highlighter-rouge">$JOSHUA/scripts/training/parallelize/LocalConfig.pm</code>
 to setup your qsub environment, and may also
+want to pass specific qsub commands via the <code 
class="highlighter-rouge">--qsub-args "ARGS"</code>
+command. We suggest you stick to the standard Joshua model that
+tries to use as many cores as are available with the <code 
class="highlighter-rouge">--threads N</code> option.</p>
+  </li>
+</ul>
+
+<h2 id="restarting-failed-runs">Restarting failed runs</h2>
+
+<p>If the pipeline dies, you can restart it with the same command you used the 
first time.  If you
+rerun the pipeline with the exact same invocation as the previous run (or an 
overlapping
+configuration â one that causes the same set of behaviors), you will see 
slightly different
+output compared to what we saw above:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>[train-copy-en] 
cached, skipping...
+[train-copy-ur] cached, skipping...
+...
+</code></pre>
+</div>
+
+<p>This indicates that the caching module has discovered that the step was 
already computed and thus
+did not need to be rerun.  This feature is quite useful for restarting 
pipeline runs that have
+crashed due to bugs, memory limitations, hardware failures, and the myriad 
other problems that
+plague MT researchers across the world.</p>
+
+<p>Often, a command will die because it was parameterized incorrectly.  For 
example, perhaps the
+decoder ran out of memory.  This allows you to adjust the parameter (e.g., 
<code class="highlighter-rouge">--joshua-mem</code>) and rerun
+the script.  Of course, if you change one of the parameters a step depends on, 
it will trigger a
+rerun, which in turn might trigger further downstream reruns.</p>
+
+<h2 id="a-idsteps--skipping-steps-quitting-early"><a id="steps"></a> Skipping 
steps, quitting early</h2>
+
+<p>You will also find it useful to start the pipeline somewhere other than 
data preparation (for
+example, if you have already-processed data and an alignment, and want to 
begin with building a
+grammar) or to end it prematurely (if, say, you donât have a test set and 
just want to tune a
+model).  This can be accomplished with the <code 
class="highlighter-rouge">--first-step</code> and <code 
class="highlighter-rouge">--last-step</code> flags, which take as
+argument a case-insensitive version of the following steps:</p>
+
+<ul>
+  <li>
+    <p><em>FIRST</em>: Data preparation.  Everything begins with data 
preparation.  This is the default first
+ step, so there is no need to be explicit about it.</p>
+  </li>
+  <li>
+    <p><em>ALIGN</em>: Alignment.  You might want to start here if you want to 
skip data preprocessing.</p>
+  </li>
+  <li>
+    <p><em>PARSE</em>: Parsing.  This is only relevant for building SAMT 
grammars (<code class="highlighter-rouge">--type samt</code>), in which case
+ the target side (<code class="highlighter-rouge">--target</code>) of the 
training data (<code class="highlighter-rouge">--corpus</code>) is parsed 
before building a
+ grammar.</p>
+  </li>
+  <li>
+    <p><em>THRAX</em>: Grammar extraction <a href="thrax.html">with Thrax</a>. 
 If you jump to this step, youâll need to
+ provide an aligned corpus (<code 
class="highlighter-rouge">--alignment</code>) along with your parallel data.  
</p>
+  </li>
+  <li>
+    <p><em>TUNE</em>: Tuning.  The exact tuning method is determined with 
<code class="highlighter-rouge">--tuner {mert,mira,pro}</code>.  With this
+ option, you need to specify a grammar (<code 
class="highlighter-rouge">--grammar</code>) or separate tune (<code 
class="highlighter-rouge">--tune-grammar</code>) and test
+ (<code class="highlighter-rouge">--test-grammar</code>) grammars.  A full 
grammar (<code class="highlighter-rouge">--grammar</code>) will be filtered 
against the relevant
+ tuning or test set unless you specify <code 
class="highlighter-rouge">--no-filter-tm</code>.  If you want a language model 
built from
+ the target side of your training data, youâll also need to pass in the 
training corpus
+ (<code class="highlighter-rouge">--corpus</code>).  You can also specify an 
arbitrary number of additional language models with one or
+ more <code class="highlighter-rouge">--lmfile</code> flags.</p>
+  </li>
+  <li>
+    <p><em>TEST</em>: Testing.  If you have a tuned model file, you can test 
new corpora by passing in a test
+ corpus with references (<code class="highlighter-rouge">--test</code>).  
Youâll need to provide a run name (<code 
class="highlighter-rouge">--name</code>) to store the
+ results of this run, which will be placed under <code 
class="highlighter-rouge">test/NAME</code>.  Youâll also need to provide a
+ Joshua configuration file (<code 
class="highlighter-rouge">--joshua-config</code>), one or more language models 
(<code class="highlighter-rouge">--lmfile</code>), and a
+ grammar (<code class="highlighter-rouge">--grammar</code>); this will be 
filtered to the test data unless you specify
+ <code class="highlighter-rouge">--no-filter-tm</code>) or unless you directly 
provide a filtered test grammar (<code 
class="highlighter-rouge">--test-grammar</code>).</p>
+  </li>
+  <li>
+    <p><em>LAST</em>: The last step.  This is the default target of <code 
class="highlighter-rouge">--last-step</code>.</p>
+  </li>
+</ul>
+
+<p>We now discuss these steps in more detail.</p>
+
+<h3 id="a-idprep--1-data-preparation"><a id="prep"></a> 1. DATA 
PREPARATION</h3>
+
+<p>Data prepare involves doing the following to each of the training data 
(<code class="highlighter-rouge">--corpus</code>), tuning data
+(<code class="highlighter-rouge">--tune</code>), and testing data (<code 
class="highlighter-rouge">--test</code>).  Each of these values is an absolute 
or relative path
+prefix.  To each of these prefixes, a â.â is appended, followed by each of 
SOURCE (<code class="highlighter-rouge">--source</code>) and
+TARGET (<code class="highlighter-rouge">--target</code>), which are file 
extensions identifying the languages.  The SOURCE and TARGET
+files must have the same number of lines.  </p>
+
+<p>For tuning and test data, multiple references are handled automatically.  A 
single reference will
+have the format TUNE.TARGET, while multiple references will have the format 
TUNE.TARGET.NUM, where
+NUM starts at 0 and increments for as many references as there are.</p>
+
+<p>The following processing steps are applied to each file.</p>
+
+<ol>
+  <li>
+    <p><strong>Copying</strong> the files into <code 
class="highlighter-rouge">$RUNDIR/data/TYPE</code>, where TYPE is one of 
âtrainâ, âtuneâ, or âtestâ.
+Multiple <code class="highlighter-rouge">--corpora</code> files are 
concatenated in the order they are specified.  Multiple <code 
class="highlighter-rouge">--tune</code>
+and <code class="highlighter-rouge">--test</code> flags are not currently 
allowed.</p>
+  </li>
+  <li>
+    <p><strong>Normalizing</strong> punctuation and text (e.g., removing extra 
spaces, converting special
+quotations).  There are a few language-specific options that depend on the 
file extension
+matching the <a 
href="http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes";>two-letter ISO 
639-1</a>
+designation.</p>
+  </li>
+  <li>
+    <p><strong>Tokenizing</strong> the data (e.g., separating out punctuation, 
converting brackets).  Again, there
+are language-specific tokenizations for a few languages (English, German, and 
Greek).</p>
+  </li>
+  <li>
+    <p>(Training only) <strong>Removing</strong> all parallel sentences with 
more than <code class="highlighter-rouge">--maxlen</code> tokens on either
+side.  By default, MAXLEN is 50.  To turn this off, specify <code 
class="highlighter-rouge">--maxlen 0</code>.</p>
+  </li>
+  <li>
+    <p><strong>Lowercasing</strong>.</p>
+  </li>
+</ol>
+
+<p>This creates a series of intermediate files which are saved for posterity 
but compressed.  For
+example, you might see</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>data/
+    train/
+        train.en.gz
+        train.tok.en.gz
+        train.tok.50.en.gz
+        train.tok.50.lc.en
+        corpus.en -&gt; train.tok.50.lc.en
+</code></pre>
+</div>
+
+<p>The file âcorpus.LANGâ is a symbolic link to the last file in the 
chain.  </p>
+
+<h2 id="alignment-a-idalignment-">2. ALIGNMENT <a id="alignment"></a></h2>
+
+<p>Alignments are between the parallel corpora at <code 
class="highlighter-rouge">$RUNDIR/data/train/corpus.{SOURCE,TARGET}</code>.  To
+prevent the alignment tables from getting too big, the parallel corpora are 
grouped into files of no
+more than ALIGNER_CHUNK_SIZE blocks (controlled with a parameter below).  The 
last block is folded
+into the penultimate block if it is too small.  These chunked files are all 
created in a
+subdirectory of <code 
class="highlighter-rouge">$RUNDIR/data/train/splits</code>, named <code 
class="highlighter-rouge">corpus.LANG.0</code>, <code 
class="highlighter-rouge">corpus.LANG.1</code>, and so on.</p>
+
+<p>The pipeline parameters affecting alignment are:</p>
+
+<ul>
+  <li>
+    <p><code class="highlighter-rouge">--aligner ALIGNER</code> {giza 
(default), berkeley, jacana}</p>
+
+    <p>Which aligner to use.  The default is <a 
href="http://code.google.com/p/giza-pp/";>GIZA++</a>, but
+<a href="http://code.google.com/p/berkeleyaligner/";>the Berkeley aligner</a> 
can be used instead.  When
+using the Berkeley aligner, youâll want to pay attention to how much memory 
you allocate to it
+with <code class="highlighter-rouge">--aligner-mem</code> (the default is 
10g).</p>
+  </li>
+  <li>
+    <p><code class="highlighter-rouge">--aligner-chunk-size SIZE</code> 
(1,000,000)</p>
+
+    <p>The number of sentence pairs to compute alignments over. The training 
data is split into blocks
+of this size, aligned separately, and then concatenated.</p>
+  </li>
+  <li>
+    <p><code class="highlighter-rouge">--alignment FILE</code></p>
+
+    <p>If you have an already-computed alignment, you can pass that to the 
script using this flag.
+Note that, in this case, you will want to skip data preparation and alignment 
using
+<code class="highlighter-rouge">--first-step thrax</code> (the first step 
after alignment) and also to specify <code 
class="highlighter-rouge">--no-prepare</code> so
+as not to retokenize the data and mess with your alignments.</p>
+
+    <p>The alignment file format is the standard format where 0-indexed 
many-many alignment pairs for a
+sentence are provided on a line, source language first, e.g.,</p>
+
+    <p>0-0 0-1 1-2 1-7 â¦</p>
+
+    <p>This value is required if you start at the grammar extraction step.</p>
+  </li>
+</ul>
+
+<p>When alignment is complete, the alignment file can be found at <code 
class="highlighter-rouge">$RUNDIR/alignments/training.align</code>.
+It is parallel to the training corpora.  There are many files in the <code 
class="highlighter-rouge">alignments/</code> subdirectory that
+contain the output of intermediate steps.</p>
+
+<h3 id="a-idparsing--3-parsing"><a id="parsing"></a> 3. PARSING</h3>
+
+<p>To build SAMT and GHKM grammars (<code class="highlighter-rouge">--type 
samt</code> and <code class="highlighter-rouge">--type ghkm</code>), the target 
side of the
+training data must be parsed. The pipeline assumes your target side will be 
English, and will parse
+it for you using <a href="http://code.google.com/p/berkeleyparser/";>the 
Berkeley parser</a>, which is included.
+If it is not the case that English is your target-side language, the target 
side of your training
+data (found at CORPUS.TARGET) must already be parsed in PTB format.  The 
pipeline will notice that
+it is parsed and will not reparse it.</p>
+
+<p>Parsing is affected by both the <code class="highlighter-rouge">--threads 
N</code> and <code class="highlighter-rouge">--jobs N</code> options.  The 
former runs the parser in
+multithreaded mode, while the latter distributes the runs across as cluster 
(and requires some
+configuration, not yet documented).  The options are mutually exclusive.</p>
+
+<p>Once the parsing is complete, there will be two parsed files:</p>
+
+<ul>
+  <li><code 
class="highlighter-rouge">$RUNDIR/data/train/corpus.en.parsed</code>: this is 
the mixed-case file that was parsed.</li>
+  <li><code 
class="highlighter-rouge">$RUNDIR/data/train/corpus.parsed.en</code>: this is a 
leaf-lowercased version of the above file used for
+grammar extraction.</li>
+</ul>
+
+<h2 id="thrax-grammar-extraction-a-idtm-">4. THRAX (grammar extraction) <a 
id="tm"></a></h2>
+
+<p>The grammar extraction step takes three pieces of data: (1) the 
source-language training corpus, (2)
+the target-language training corpus (parsed, if an SAMT grammar is being 
extracted), and (3) the
+alignment file.  From these, it computes a synchronous context-free grammar.  
If you already have a
+grammar and wish to skip this step, you can do so passing the grammar with the 
<code class="highlighter-rouge">--grammar
+/path/to/grammar</code> flag.</p>
+
+<p>The main variable in grammar extraction is Hadoop.  If you have a Hadoop 
installation, simply ensure
+that the environment variable <code class="highlighter-rouge">$HADOOP</code> 
is defined, and Thrax will seamlessly use it.  If you <em>do
+not</em> have a Hadoop installation, the pipeline will roll out out for you, 
running Hadoop in
+standalone mode (this mode is triggered when <code 
class="highlighter-rouge">$HADOOP</code> is undefined).  Theoretically, any 
grammar
+extractable on a full Hadoop cluster should be extractable in standalone mode, 
if you are patient
+enough; in practice, you probably are not patient enough, and will be limited 
to smaller
+datasets. You may also run into problems with disk space; Hadoop uses a lot 
(use <code class="highlighter-rouge">--tmp
+/path/to/tmp</code> to specify an alternate place for temporary data; we 
suggest you use a local disk
+partition with tens or hundreds of gigabytes free, and not an NFS partition).  
Setting up your own
+Hadoop cluster is not too difficult a chore; in particular, you may find it 
helpful to install a
+<a 
href="http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html";>pseudo-distributed
 version of Hadoop</a>.
+In our experience, this works fine, but you should note the following 
caveats:</p>
+
+<ul>
+  <li>It is of crucial importance that you have enough physical disks.  We 
have found that having too
+few, or too slow of disks, results in a whole host of seemingly unrelated 
issues that are hard to
+resolve, such as timeouts.  </li>
+  <li>NFS filesystems can cause lots of problems.  You should really try to 
install physical disks that
+are dedicated to Hadoop scratch space.</li>
+</ul>
+
+<p>Here are some flags relevant to Hadoop and grammar extraction with 
Thrax:</p>
+
+<ul>
+  <li>
+    <p><code class="highlighter-rouge">--hadoop /path/to/hadoop</code></p>
+
+    <p>This sets the location of Hadoop (overriding the environment variable 
<code class="highlighter-rouge">$HADOOP</code>)</p>
+  </li>
+  <li>
+    <p><code class="highlighter-rouge">--hadoop-mem MEM</code> (2g)</p>
+
+    <p>This alters the amount of memory available to Hadoop mappers (passed 
via the
+<code class="highlighter-rouge">mapred.child.java.opts</code> options).</p>
+  </li>
+  <li>
+    <p><code class="highlighter-rouge">--thrax-conf FILE</code></p>
+
+    <p>Use the provided Thrax configuration file instead of the 
(grammar-specific) default.  The Thrax
+ templates are located at <code 
class="highlighter-rouge">$JOSHUA/scripts/training/templates/thrax-TYPE.conf</code>,
 where TYPE is one
+ of âhieroâ or âsamtâ.</p>
+  </li>
+</ul>
+
+<p>When the grammar is extracted, it is compressed and placed at <code 
class="highlighter-rouge">$RUNDIR/grammar.gz</code>.</p>
+
+<h2 id="a-idlm--5-language-model"><a id="lm"></a> 5. Language model</h2>
+
+<p>Before tuning can take place, a language model is needed.  A language model 
is always built from the
+target side of the training corpus unless <code 
class="highlighter-rouge">--no-corpus-lm</code> is specified.  In addition, you 
can
+provide other language models (any number of them) with the <code 
class="highlighter-rouge">--lmfile FILE</code> argument.  Other
+arguments are as follows.</p>
+
+<ul>
+  <li>
+    <p><code class="highlighter-rouge">--lm</code> {kenlm (default), 
berkeleylm}</p>
+
+    <p>This determines the language model code that will be used when 
decoding.  These implementations
+are described in their respective papers (PDFs:
+<a href="http://kheafield.com/professional/avenue/kenlm.pdf";>KenLM</a>,
+<a 
href="http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf";>BerkeleyLM</a>).
 KenLM is written in
+C++ and requires a pass through the JNI, but is recommended because it 
supports left-state minimization.</p>
+  </li>
+  <li>
+    <p><code class="highlighter-rouge">--lmfile FILE</code></p>
+
+    <p>Specifies a pre-built language model to use when decoding.  This 
language model can be in ARPA
+format, or in KenLM format when using KenLM or BerkeleyLM format when using 
that format.</p>
+  </li>
+  <li>
+    <p><code class="highlighter-rouge">--lm-gen</code> {kenlm (default), 
srilm, berkeleylm}, <code class="highlighter-rouge">--buildlm-mem MEM</code>, 
<code class="highlighter-rouge">--witten-bell</code></p>
+
+    <p>At the tuning step, an LM is built from the target side of the training 
data (unless
+<code class="highlighter-rouge">--no-corpus-lm</code> is specified).  This 
controls which code is used to build it.  The default is a
+KenLMâs <a href="http://kheafield.com/code/kenlm/estimation/";>lmplz</a>, and 
is strongly recommended.</p>
+
+    <p>If SRILM is used, it is called with the following arguments:</p>
+
+    <div class="highlighter-rouge"><pre class="highlight"><code>  
$SRILM/bin/i686-m64/ngram-count -interpolate SMOOTHING -order 5 -text 
TRAINING-DATA -unk -lm lm.gz
+</code></pre>
+    </div>
+
+    <p>Where SMOOTHING is <code class="highlighter-rouge">-kndiscount</code>, 
or <code class="highlighter-rouge">-wbdiscount</code> if <code 
class="highlighter-rouge">--witten-bell</code> is passed to the pipeline.</p>
+
+    <p><a 
href="http://code.google.com/p/berkeleylm/source/browse/trunk/src/edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText.java";>BerkeleyLM
 java class</a>
+is also available. It computes a Kneser-Ney LM with a constant discounting 
(0.75) and no count
+thresholding.  The flag <code class="highlighter-rouge">--buildlm-mem</code> 
can be used to control how much memory is allocated to the
+Java process.  The default is â2gâ, but you will want to increase it for 
larger language models.</p>
+
+    <p>A language model built from the target side of the training data is 
placed at <code class="highlighter-rouge">$RUNDIR/lm.gz</code>.  </p>
+  </li>
+</ul>
+
+<h2 id="interlude-decoder-arguments">Interlude: decoder arguments</h2>
+
+<p>Running the decoder is done in both the tuning stage and the testing stage. 
 A critical point is
+that you have to give the decoder enough memory to run.  Joshua can be very 
memory-intensive, in
+particular when decoding with large grammars and large language models.  The 
default amount of
+memory is 3100m, which is likely not enough (especially if you are decoding 
with SAMT grammar).  You
+can alter the amount of memory for Joshua using the <code 
class="highlighter-rouge">--joshua-mem MEM</code> argument, where MEM is a Java
+memory specification (passed to its <code 
class="highlighter-rouge">-Xmx</code> flag).</p>
+
+<h2 id="a-idtuning--6-tuning"><a id="tuning"></a> 6. TUNING</h2>
+
+<p>Two optimizers are provided with Joshua: MERT and PRO (<code 
class="highlighter-rouge">--tuner {mert,pro}</code>).  If Moses is
+installed, you can also use Cherry &amp; Fosterâs k-best batch MIRA (<code 
class="highlighter-rouge">--tuner mira</code>, recommended).
+Tuning is run till convergence in the <code 
class="highlighter-rouge">$RUNDIR/tune</code> directory.</p>
+
+<p>When tuning is finished, each final configuration file can be found at 
either</p>
+
+<div class="highlighter-rouge"><pre 
class="highlight"><code>$RUNDIR/tune/joshua.config.final
+</code></pre>
+</div>
+
+<h2 id="a-idtesting--7-testing"><a id="testing"></a> 7. Testing</h2>
+
+<p>For each of the tuner runs, Joshua takes the tuner output file and decodes 
the test set.  If you
+like, you can also apply minimum Bayes-risk decoding to the decoder output 
with <code class="highlighter-rouge">--mbr</code>.  This
+usually yields about 0.3 - 0.5 BLEU points, but is time-consuming.</p>
+
+<p>After decoding the test set with each set of tuned weights, Joshua computes 
the mean BLEU score,
+writes it to <code class="highlighter-rouge">$RUNDIR/test/final-bleu</code>, 
and cats it. It also writes a file
+<code class="highlighter-rouge">$RUNDIR/test/final-times</code> containing a 
summary of runtime information. Thatâs the end of the pipeline!</p>
+
+<p>Joshua also supports decoding further test sets.  This is enabled by 
rerunning the pipeline with a
+number of arguments:</p>
+
+<ul>
+  <li>
+    <p><code class="highlighter-rouge">--first-step TEST</code></p>
+
+    <p>This tells the decoder to start at the test step.</p>
+  </li>
+  <li>
+    <p><code class="highlighter-rouge">--joshua-config CONFIG</code></p>
+
+    <p>A tuned parameter file is required.  This file will be the output of 
some prior tuning run.
+Necessary pathnames and so on will be adjusted.</p>
+  </li>
+</ul>
+
+<h2 id="a-idanalysis-8-analysis"><a id="analysis"> 8. ANALYSIS</a></h2>
+
+<p>If you have used the suggested layout, with a number of related runs all 
contained in a common
+directory with sequential numbers, you can use the script <code 
class="highlighter-rouge">$JOSHUA/scripts/training/summarize.pl</code> to
+display a summary of the mean BLEU scores from all runs, along with the text 
you placed in the run
+README file (using the pipelineâs <code class="highlighter-rouge">--readme 
TEXT</code> flag).</p>
+
+<h2 id="common-use-cases-and-pitfalls">COMMON USE CASES AND PITFALLS</h2>
+
+<ul>
+  <li>
+    <p>If the pipeline dies at the âthrax-runâ stage with an error like 
the following:</p>
+
+    <div class="highlighter-rouge"><pre class="highlight"><code>JOB FAILED 
(return code 1) 
+hadoop/bin/hadoop: line 47: 
+/some/path/to/a/directory/hadoop/bin/hadoop-config.sh: No such file or 
directory 
+Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/hadoop/fs/FsShell 
+Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FsShell 
+</code></pre>
+    </div>
+
+    <p>This occurs if the <code class="highlighter-rouge">$HADOOP</code> 
environment variable is set but does not point to a working
+Hadoop installation.  To fix it, make sure to unset the variable:</p>
+
+    <div class="highlighter-rouge"><pre class="highlight"><code># in bash
+unset HADOOP
+</code></pre>
+    </div>
+
+    <p>and then rerun the pipeline with the same invocation.</p>
+  </li>
+  <li>
+    <p>Memory usage is a major consideration in decoding with Joshua and 
hierarchical grammars.  In
+particular, SAMT grammars often require a large amount of memory.  Many steps 
have been taken to
+reduce memory usage, including beam settings and test-set- and sentence-level 
filtering of
+grammars.  However, memory usage can still be in the tens of gigabytes.</p>
+
+    <p>To accommodate this kind of variation, the pipeline script allows you 
to specify both (a) the
+amount of memory used by the Joshua decoder instance and (b) the amount of 
memory required of
+nodes obtained by the qsub command.  These are accomplished with the <code 
class="highlighter-rouge">--joshua-mem</code> MEM and
+<code class="highlighter-rouge">--qsub-args</code> ARGS commands.  For 
example,</p>
+
+    <div class="highlighter-rouge"><pre class="highlight"><code>pipeline.pl 
--joshua-mem 32g --qsub-args "-l pvmem=32g -q himem.q" ...
+</code></pre>
+    </div>
+
+    <p>Also, should Thrax fail, it might be due to a memory restriction. By 
default, Thrax requests 2 GB
+from the Hadoop server. If more memory is needed, set the memory requirement 
with the
+<code class="highlighter-rouge">--hadoop-mem</code> in the same way as the 
<code class="highlighter-rouge">--joshua-mem</code> option is used.</p>
+  </li>
+  <li>
+    <p>Other pitfalls and advice will be added as it is discovered.</p>
+  </li>
+</ul>
+
+<h2 id="feedback">FEEDBACK</h2>
+
+<p>Please email [email protected] with problems or 
suggestions.</p>
+
+
+
+          <!--   <h4 class="blog-post-title">Welcome to Joshua!</h4> -->
+
+          <!--   <p>This blog post shows a few different types of content 
that's supported and styled with Bootstrap. Basic typography, images, and code 
are all supported.</p> -->
+          <!--   <hr> -->
+          <!--   <p>Cum sociis natoque penatibus et magnis <a href="#">dis 
parturient montes</a>, nascetur ridiculus mus. Aenean eu leo quam. Pellentesque 
ornare sem lacinia quam venenatis vestibulum. Sed posuere consectetur est at 
lobortis. Cras mattis consectetur purus sit amet fermentum.</p> -->
+          <!--   <blockquote> -->
+          <!--     <p>Curabitur blandit tempus porttitor. <strong>Nullam quis 
risus eget urna mollis</strong> ornare vel eu leo. Nullam id dolor id nibh 
ultricies vehicula ut id elit.</p> -->
+          <!--   </blockquote> -->
+          <!--   <p>Etiam porta <em>sem malesuada magna</em> mollis euismod. 
Cras mattis consectetur purus sit amet fermentum. Aenean lacinia bibendum nulla 
sed consectetur.</p> -->
+          <!--   <h2>Heading</h2> -->
+          <!--   <p>Vivamus sagittis lacus vel augue laoreet rutrum faucibus 
dolor auctor. Duis mollis, est non commodo luctus, nisi erat porttitor ligula, 
eget lacinia odio sem nec elit. Morbi leo risus, porta ac consectetur ac, 
vestibulum at eros.</p> -->
+          <!--   <h3>Sub-heading</h3> -->
+          <!--   <p>Cum sociis natoque penatibus et magnis dis parturient 
montes, nascetur ridiculus mus.</p> -->
+          <!--   <pre><code>Example code block</code></pre> -->
+          <!--   <p>Aenean lacinia bibendum nulla sed consectetur. Etiam porta 
sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, 
tortor mauris condimentum nibh, ut fermentum massa.</p> -->
+          <!--   <h3>Sub-heading</h3> -->
+          <!--   <p>Cum sociis natoque penatibus et magnis dis parturient 
montes, nascetur ridiculus mus. Aenean lacinia bibendum nulla sed consectetur. 
Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus 
commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet 
risus.</p> -->
+          <!--   <ul> -->
+          <!--     <li>Praesent commodo cursus magna, vel scelerisque nisl 
consectetur et.</li> -->
+          <!--     <li>Donec id elit non mi porta gravida at eget metus.</li> 
-->
+          <!--     <li>Nulla vitae elit libero, a pharetra augue.</li> -->
+          <!--   </ul> -->
+          <!--   <p>Donec ullamcorper nulla non metus auctor fringilla. Nulla 
vitae elit libero, a pharetra augue.</p> -->
+          <!--   <ol> -->
+          <!--     <li>Vestibulum id ligula porta felis euismod semper.</li> 
-->
+          <!--     <li>Cum sociis natoque penatibus et magnis dis parturient 
montes, nascetur ridiculus mus.</li> -->
+          <!--     <li>Maecenas sed diam eget risus varius blandit sit amet 
non magna.</li> -->
+          <!--   </ol> -->
+          <!--   <p>Cras mattis consectetur purus sit amet fermentum. Sed 
posuere consectetur est at lobortis.</p> -->
+          <!-- </div><\!-- /.blog-post -\-> -->
+
+        </div>
+
+      </div><!-- /.row -->
+
+      
+        
+    </div><!-- /.container -->
+
+    <!-- Bootstrap core JavaScript
+    ================================================== -->
+    <!-- Placed at the end of the document so the pages load faster -->
+    <script 
src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js";></script>
+    <script src="../../dist/js/bootstrap.min.js"></script>
+    <!-- <script src="../../assets/js/docs.min.js"></script> -->
+    <!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
+    <!-- <script 
src="../../assets/js/ie10-viewport-bug-workaround.js"></script>
+    -->
+
+    <!-- Start of StatCounter Code for Default Guide -->
+    <script type="text/javascript">
+      var sc_project=8264132; 
+      var sc_invisible=1; 
+      var sc_security="4b97fe2d"; 
+    </script>
+    <script type="text/javascript" 
src="http://www.statcounter.com/counter/counter.js";></script>
+    <noscript>
+      <div class="statcounter">
+        <a title="hit counter joomla" 
+           href="http://statcounter.com/joomla/";
+           target="_blank">
+          <img class="statcounter"
+               src="http://c.statcounter.com/8264132/0/4b97fe2d/1/";
+               alt="hit counter joomla" />
+        </a>
+      </div>
+    </noscript>
+    <!-- End of StatCounter Code for Default Guide -->
+  </body>
+</html>
+


http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/6.0/pipeline.md
----------------------------------------------------------------------
diff --git a/6.0/pipeline.md b/6.0/pipeline.md
deleted file mode 100644
index 4389435..0000000
--- a/6.0/pipeline.md
+++ /dev/null
@@ -1,666 +0,0 @@
----
-layout: default6
-category: links
-title: The Joshua Pipeline
----
-
-*Please note that the Joshua 6.0.3 included some big changes to directory 
organization of the
- pipeline's files.*
-
-This page describes the Joshua pipeline script, which manages the complexity 
of training and
-evaluating machine translation systems.  The pipeline eases the pain of two 
related tasks in
-statistical machine translation (SMT) research:
-
-- Training SMT systems involves a complicated process of interacting steps 
that are
-  time-consuming and prone to failure.
-
-- Developing and testing new techniques requires varying parameters at 
different points in the
-  pipeline. Earlier results (which are often expensive) need not be recomputed.
-
-To facilitate these tasks, the pipeline script:
-
-- Runs the complete SMT pipeline, from corpus normalization and tokenization, 
through alignment,
-  model building, tuning, test-set decoding, and evaluation.
-
-- Caches the results of intermediate steps (using robust SHA-1 checksums on 
dependencies), so the
-  pipeline can be debugged or shared across similar runs while doing away with 
time spent
-  recomputing expensive steps.
- 
-- Allows you to jump into and out of the pipeline at a set of predefined 
places (e.g., the alignment
-  stage), so long as you provide the missing dependencies.
-
-The Joshua pipeline script is designed in the spirit of Moses' 
`train-model.pl`, and shares
-(and has borrowed) many of its features.  It is not as extensive as Moses'
-[Experiment Management 
System](http://www.statmt.org/moses/?n=FactoredTraining.EMS), which allows
-the user to define arbitrary execution dependency graphs. However, it is 
significantly simpler to
-use, allowing many systems to be built with a single command (that may run for 
days or weeks).
-
-## Dependencies
-
-The pipeline has no *required* external dependencies.  However, it has support 
for a number of
-external packages, some of which are included with Joshua.
-
--  [GIZA++](http://code.google.com/p/giza-pp/) (included)
-
-   GIZA++ is the default aligner.  It is included with Joshua, and should 
compile successfully when
-   you typed `ant` from the Joshua root directory.  It is not required because 
you can use the
-   (included) Berkeley aligner (`--aligner berkeley`). We have recently also 
provided support
-   for the [Jacana-XY 
aligner](http://code.google.com/p/jacana-xy/wiki/JacanaXY) (`--aligner
-   jacana`). 
-
--  [Hadoop](http://hadoop.apache.org/) (included)
-
-   The pipeline uses the [Thrax grammar extractor](thrax.html), which is built 
on Hadoop.  If you
-   have a Hadoop installation, simply ensure that the `$HADOOP` environment 
variable is defined, and
-   the pipeline will use it automatically at the grammar extraction step.  If 
you are going to
-   attempt to extract very large grammars, it is best to have a good-sized 
Hadoop installation.
-   
-   (If you do not have a Hadoop installation, you might consider setting one 
up.  Hadoop can be
-   installed in a
-   
["pseudo-distributed"](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed)
-   mode that allows it to use just a few machines or a number of processors on 
a single machine.
-   The main issue is to ensure that there are a lot of independent physical 
disks, since in our
-   experience Hadoop starts to exhibit lots of hard-to-trace problems if there 
is too much demand on
-   the disks.)
-   
-   If you don't have a Hadoop installation, there are still no worries.  The 
pipeline will unroll a
-   standalone installation and use it to extract your grammar.  This behavior 
will be triggered if
-   `$HADOOP` is undefined.
-   
--  [Moses](http://statmt.org/moses/) (not included). Moses is needed
-   if you wish to use its 'kbmira' tuner (--tuner kbmira), or if you
-   wish to build phrase-based models.
-   
--  [SRILM](http://www.speech.sri.com/projects/srilm/) (not included; not 
needed; not recommended)
-
-   By default, the pipeline uses the included 
[KenLM](https://kheafield.com/code/kenlm/) for
-   building (and also querying) language models. Joshua also includes a Java 
program from the
-   [Berkeley LM](http://code.google.com/p/berkeleylm/) package that contains 
code for constructing a
-   Kneser-Ney-smoothed language model in ARPA format from the target side of 
your training data.  
-   There is no need to use SRILM, but if you do wish to use it, you need to do 
the following:
-   
-   1. Install SRILM and set the `$SRILM` environment variable to point to its 
installed location.
-   1. Add the `--lm-gen srilm` flag to your pipeline invocation.
-   
-   More information on this is available in the [LM building section of the 
pipeline](#lm).  SRILM
-   is not used for representing language models during decoding (and in fact 
is not supported,
-   having been supplanted by [KenLM](http://kheafield.com/code/kenlm/) (the 
default) and
-   BerkeleyLM).
-
-After installing any dependencies, follow the brief instructions on
-the [installation page](install.html), and then you are ready to build
-models. 
-
-## A basic pipeline run
-
-The pipeline takes a set of inputs (training, tuning, and test data), and 
creates a set of
-intermediate files in the *run directory*.  By default, the run directory is 
the current directory,
-but it can be changed with the `--rundir` parameter.
-
-For this quick start, we will be working with the example that can be found in
-`$JOSHUA/examples/training`.  This example contains 1,000 sentences of 
Urdu-English data (the full
-dataset is available as part of the
-[Indian languages parallel corpora](/indian-parallel-corpora/) with
-100-sentence tuning and test sets with four references each.
-
-Running the pipeline requires two main steps: data preparation and invocation.
-
-1. Prepare your data.  The pipeline script needs to be told where to find the 
raw training, tuning,
-   and test data.  A good convention is to place these files in an input/ 
subdirectory of your run's
-   working directory (NOTE: do not use `data/`, since a directory of that name 
is created and used
-   by the pipeline itself for storing processed files).  The expected format 
(for each of training,
-   tuning, and test) is a pair of files that share a common path prefix and 
are distinguished by
-   their extension, e.g.,
-
-       input/
-             train.SOURCE
-             train.TARGET
-             tune.SOURCE
-             tune.TARGET
-             test.SOURCE
-             test.TARGET
-
-   These files should be parallel at the sentence level (with one sentence per 
line), should be in
-   UTF-8, and should be untokenized (tokenization occurs in the pipeline).  
SOURCE and TARGET denote
-   variables that should be replaced with the actual target and source 
language abbreviations (e.g.,
-   "ur" and "en").
-   
-1. Run the pipeline.  The following is the minimal invocation to run the 
complete pipeline:
-
-       $JOSHUA/bin/pipeline.pl  \
-         --rundir .             \
-         --type hiero           \
-         --corpus input/train   \
-         --tune input/tune      \
-         --test input/devtest   \
-         --source SOURCE        \
-         --target TARGET
-
-   The `--corpus`, `--tune`, and `--test` flags define file prefixes that are 
concatened with the
-   language extensions given by `--target` and `--source` (with a "." in 
between).  Note the
-   correspondences with the files defined in the first step above.  The 
prefixes can be either
-   absolute or relative pathnames.  This particular invocation assumes that a 
subdirectory `input/`
-   exists in the current directory, that you are translating from a language 
identified "ur"
-   extension to a language identified by the "en" extension, that the training 
data can be found at
-   `input/train.en` and `input/train.ur`, and so on.
-
-*Don't* run the pipeline directly from `$JOSHUA`, or, for that matter, in any 
directory with lots of other files.
-This can cause problems because the pipeline creates lots of files under 
`--rundir` that can clobber existing files.
-You should run experiments in a clean directory.
-For example, if you have Joshua installed in `$HOME/code/joshua`, manage your 
runs in a different location, such as `$HOME/expts/joshua`.
-
-Assuming no problems arise, this command will run the complete pipeline in 
about 20 minutes,
-producing BLEU scores at the end.  As it runs, you will see output that looks 
like the following:
-   
-    [train-copy-en] rebuilding...
-      dep=/Users/post/code/joshua/test/pipeline/input/train.en 
-      dep=data/train/train.en.gz [NOT FOUND]
-      cmd=cat /Users/post/code/joshua/test/pipeline/input/train.en | gzip -9n 
> data/train/train.en.gz
-      took 0 seconds (0s)
-    [train-copy-ur] rebuilding...
-      dep=/Users/post/code/joshua/test/pipeline/input/train.ur 
-      dep=data/train/train.ur.gz [NOT FOUND]
-      cmd=cat /Users/post/code/joshua/test/pipeline/input/train.ur | gzip -9n 
> data/train/train.ur.gz
-      took 0 seconds (0s)
-    ...
-   
-And in the current directory, you will see the following files (among
-other files, including intermediate files
-generated by the individual sub-steps).
-   
-    data/
-        train/
-            corpus.ur
-            corpus.en
-            thrax-input-file
-        tune/
-            corpus.ur -> tune.tok.lc.ur
-            corpus.en -> tune.tok.lc.en
-            grammar.filtered.gz
-            grammar.glue
-        test/
-            corpus.ur -> test.tok.lc.ur
-            corpus.en -> test.tok.lc.en
-            grammar.filtered.gz
-            grammar.glue
-    alignments/
-        0/
-            [giza/berkeley aligner output files]
-        1/
-        ...
-        training.align
-    thrax-hiero.conf
-    thrax.log
-    grammar.gz
-    lm.gz
-    tune/
-         decoder_command
-         model/
-               [model files]
-         params.txt
-         joshua.log
-         mert.log
-         joshua.config.final
-         final-bleu
-    test/
-         model/
-               [model files]
-         output
-         final-bleu
-
-These files will be described in more detail in subsequent sections of this 
tutorial.
-
-Another useful flag is the `--rundir DIR` flag, which chdir()s to the 
specified directory before
-running the pipeline.  By default the rundir is the current directory.  
Changing it can be useful
-for organizing related pipeline runs.  In fact, we highly recommend
-that you organize your runs using consecutive integers, also taking a
-minute to pass a short note with the `--readme` flag, which allows you
-to quickly generate reports on [groups of related experiments](#managing).
-Relative paths specified to other flags (e.g., to `--corpus`
-or `--lmfile`) are relative to the directory the pipeline was called *from*, 
not the rundir itself
-(unless they happen to be the same, of course).
-
-The complete pipeline comprises many tens of small steps, which can be grouped 
together into a set
-of traditional pipeline tasks:
-   
-1. [Data preparation](#prep)
-1. [Alignment](#alignment)
-1. [Parsing](#parsing) (syntax-based grammars only)
-1. [Grammar extraction](#tm)
-1. [Language model building](#lm)
-1. [Tuning](#tuning)
-1. [Testing](#testing)
-1. [Analysis](#analysis)
-
-These steps are discussed below, after a few intervening sections about 
high-level details of the
-pipeline.
-
-## <a id="managing" /> Managing groups of experiments
-
-The real utility of the pipeline comes when you use it to manage groups of 
experiments. Typically,
-there is a held-out test set, and we want to vary a number of training 
parameters to determine what
-effect this has on BLEU scores or some other metric. Joshua comes with a script
-`$JOSHUA/scripts/training/summarize.pl` that collects information from a group 
of runs and reports
-them to you. This script works so long as you organize your runs as follows:
-
-1. Your runs should be grouped together in a root directory, which I'll call 
`$EXPDIR`.
-
-2. For comparison purposes, the runs should all be evaluated on the same test 
set.
-
-3. Each run in the run group should be in its own numbered directory, shown 
with the files used by
-the summarize script:
-
-       $RUNDIR/
-           1/
-               README.txt
-               test/
-                   final-bleu
-                   final-times
-               [other files]
-           2/
-               README.txt
-               test/
-                   final-bleu
-                   final-times
-               [other files]
-               ...
-               
-You can get such directories using the `--rundir N` flag to the pipeline. 
-
-Run directories can build off each other. For example, `1/` might contain a 
complete baseline
-run. If you wanted to just change the tuner, you don't need to rerun the 
aligner and model builder,
-so you can reuse the results by supplying the second run with the information 
it needs that was
-computed in step 1:
-
-    $JOSHUA/bin/pipeline.pl \
-      --first-step tune \
-      --grammar 1/grammar.gz \
-      ...
-      
-More details are below.
-
-## Grammar options
-
-Hierarchical Joshua can extract three types of grammars: Hiero
-grammars, GHKM, and SAMT grammars.  As described on the
-[file formats page](file-formats.html), all of them are encoded into
-the same file format, but they differ in terms of the richness of
-their nonterminal sets.
-
-Hiero grammars make use of a single nonterminals, and are extracted by 
computing phrases from
-word-based alignments and then subtracting out phrase differences.  More 
detail can be found in
-[Chiang (2007) 
[PDF]](http://www.mitpressjournals.org/doi/abs/10.1162/coli.2007.33.2.201).
-[GHKM](http://www.isi.edu/%7Emarcu/papers/cr_ghkm_naacl04.pdf) (new with 5.0) 
and
-[SAMT](http://www.cs.cmu.edu/~zollmann/samt/) grammars make use of a source- 
or target-side parse
-tree on the training data, differing in the way they extract rules using these 
trees: GHKM extracts
-synchronous tree substitution grammar rules rooted in a subset of the tree 
constituents, whereas
-SAMT projects constituent labels down onto phrases.  SAMT grammars are usually 
many times larger and
-are much slower to decode with, but sometimes increase BLEU score.  Both 
grammar formats are
-extracted with the [Thrax software](thrax.html).
-
-By default, the Joshua pipeline extract a Hiero grammar, but this can be 
altered with the `--type
-(ghkm|samt)` flag. For GHKM grammars, the default is to use
-[Michel Galley's 
extractor](http://www-nlp.stanford.edu/~mgalley/software/stanford-ghkm-latest.tar.gz),
-but you can also use Moses' extractor with `--ghkm-extractor moses`. Galley's 
extractor only outputs
-two features, so the scores tend to be significantly lower than that of Moses'.
-
-Joshua (new in version 6) also includes an unlexicalized phrase-based
-decoder. Building a phrase-based model requires you to have Moses
-installed, since its `train-model.perl` script is used to extract the
-phrase table. You can enable this by defining the `$MOSES` environment
-variable and then specifying `--type phrase`.
-
-## Other high-level options
-
-The following command-line arguments control run-time behavior of multiple 
steps:
-
-- `--threads N` (1)
-
-  This enables multithreaded operation for a number of steps: alignment (with 
GIZA, max two
-  threads), parsing, and decoding (any number of threads)
-  
-- `--jobs N` (1)
-
-  This enables parallel operation over a cluster using the qsub command.  This 
feature is not
-  well-documented at this point, but you will likely want to edit the file
-  `$JOSHUA/scripts/training/parallelize/LocalConfig.pm` to setup your qsub 
environment, and may also
-  want to pass specific qsub commands via the `--qsub-args "ARGS"`
-  command. We suggest you stick to the standard Joshua model that
-  tries to use as many cores as are available with the `--threads N` option.
-
-## Restarting failed runs
-
-If the pipeline dies, you can restart it with the same command you used the 
first time.  If you
-rerun the pipeline with the exact same invocation as the previous run (or an 
overlapping
-configuration -- one that causes the same set of behaviors), you will see 
slightly different
-output compared to what we saw above:
-
-    [train-copy-en] cached, skipping...
-    [train-copy-ur] cached, skipping...
-    ...
-
-This indicates that the caching module has discovered that the step was 
already computed and thus
-did not need to be rerun.  This feature is quite useful for restarting 
pipeline runs that have
-crashed due to bugs, memory limitations, hardware failures, and the myriad 
other problems that
-plague MT researchers across the world.
-
-Often, a command will die because it was parameterized incorrectly.  For 
example, perhaps the
-decoder ran out of memory.  This allows you to adjust the parameter (e.g., 
`--joshua-mem`) and rerun
-the script.  Of course, if you change one of the parameters a step depends on, 
it will trigger a
-rerun, which in turn might trigger further downstream reruns.
-   
-## <a id="steps" /> Skipping steps, quitting early
-
-You will also find it useful to start the pipeline somewhere other than data 
preparation (for
-example, if you have already-processed data and an alignment, and want to 
begin with building a
-grammar) or to end it prematurely (if, say, you don't have a test set and just 
want to tune a
-model).  This can be accomplished with the `--first-step` and `--last-step` 
flags, which take as
-argument a case-insensitive version of the following steps:
-
-- *FIRST*: Data preparation.  Everything begins with data preparation.  This 
is the default first
-   step, so there is no need to be explicit about it.
-
-- *ALIGN*: Alignment.  You might want to start here if you want to skip data 
preprocessing.
-
-- *PARSE*: Parsing.  This is only relevant for building SAMT grammars (`--type 
samt`), in which case
-   the target side (`--target`) of the training data (`--corpus`) is parsed 
before building a
-   grammar.
-
-- *THRAX*: Grammar extraction [with Thrax](thrax.html).  If you jump to this 
step, you'll need to
-   provide an aligned corpus (`--alignment`) along with your parallel data.  
-
-- *TUNE*: Tuning.  The exact tuning method is determined with `--tuner 
{mert,mira,pro}`.  With this
-   option, you need to specify a grammar (`--grammar`) or separate tune 
(`--tune-grammar`) and test
-   (`--test-grammar`) grammars.  A full grammar (`--grammar`) will be filtered 
against the relevant
-   tuning or test set unless you specify `--no-filter-tm`.  If you want a 
language model built from
-   the target side of your training data, you'll also need to pass in the 
training corpus
-   (`--corpus`).  You can also specify an arbitrary number of additional 
language models with one or
-   more `--lmfile` flags.
-
-- *TEST*: Testing.  If you have a tuned model file, you can test new corpora 
by passing in a test
-   corpus with references (`--test`).  You'll need to provide a run name 
(`--name`) to store the
-   results of this run, which will be placed under `test/NAME`.  You'll also 
need to provide a
-   Joshua configuration file (`--joshua-config`), one or more language models 
(`--lmfile`), and a
-   grammar (`--grammar`); this will be filtered to the test data unless you 
specify
-   `--no-filter-tm`) or unless you directly provide a filtered test grammar 
(`--test-grammar`).
-
-- *LAST*: The last step.  This is the default target of `--last-step`.
-
-We now discuss these steps in more detail.
-
-### <a id="prep" /> 1. DATA PREPARATION
-
-Data prepare involves doing the following to each of the training data 
(`--corpus`), tuning data
-(`--tune`), and testing data (`--test`).  Each of these values is an absolute 
or relative path
-prefix.  To each of these prefixes, a "." is appended, followed by each of 
SOURCE (`--source`) and
-TARGET (`--target`), which are file extensions identifying the languages.  The 
SOURCE and TARGET
-files must have the same number of lines.  
-
-For tuning and test data, multiple references are handled automatically.  A 
single reference will
-have the format TUNE.TARGET, while multiple references will have the format 
TUNE.TARGET.NUM, where
-NUM starts at 0 and increments for as many references as there are.
-
-The following processing steps are applied to each file.
-
-1.  **Copying** the files into `$RUNDIR/data/TYPE`, where TYPE is one of 
"train", "tune", or "test".
-    Multiple `--corpora` files are concatenated in the order they are 
specified.  Multiple `--tune`
-    and `--test` flags are not currently allowed.
-    
-1.  **Normalizing** punctuation and text (e.g., removing extra spaces, 
converting special
-    quotations).  There are a few language-specific options that depend on the 
file extension
-    matching the [two-letter ISO 
639-1](http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
-    designation.
-
-1.  **Tokenizing** the data (e.g., separating out punctuation, converting 
brackets).  Again, there
-    are language-specific tokenizations for a few languages (English, German, 
and Greek).
-
-1.  (Training only) **Removing** all parallel sentences with more than 
`--maxlen` tokens on either
-    side.  By default, MAXLEN is 50.  To turn this off, specify `--maxlen 0`.
-
-1.  **Lowercasing**.
-
-This creates a series of intermediate files which are saved for posterity but 
compressed.  For
-example, you might see
-
-    data/
-        train/
-            train.en.gz
-            train.tok.en.gz
-            train.tok.50.en.gz
-            train.tok.50.lc.en
-            corpus.en -> train.tok.50.lc.en
-
-The file "corpus.LANG" is a symbolic link to the last file in the chain.  
-
-## 2. ALIGNMENT <a id="alignment" />
-
-Alignments are between the parallel corpora at 
`$RUNDIR/data/train/corpus.{SOURCE,TARGET}`.  To
-prevent the alignment tables from getting too big, the parallel corpora are 
grouped into files of no
-more than ALIGNER\_CHUNK\_SIZE blocks (controlled with a parameter below).  
The last block is folded
-into the penultimate block if it is too small.  These chunked files are all 
created in a
-subdirectory of `$RUNDIR/data/train/splits`, named `corpus.LANG.0`, 
`corpus.LANG.1`, and so on.
-
-The pipeline parameters affecting alignment are:
-
--   `--aligner ALIGNER` {giza (default), berkeley, jacana}
-
-    Which aligner to use.  The default is 
[GIZA++](http://code.google.com/p/giza-pp/), but
-    [the Berkeley aligner](http://code.google.com/p/berkeleyaligner/) can be 
used instead.  When
-    using the Berkeley aligner, you'll want to pay attention to how much 
memory you allocate to it
-    with `--aligner-mem` (the default is 10g).
-
--   `--aligner-chunk-size SIZE` (1,000,000)
-
-    The number of sentence pairs to compute alignments over. The training data 
is split into blocks
-    of this size, aligned separately, and then concatenated.
-    
--   `--alignment FILE`
-
-    If you have an already-computed alignment, you can pass that to the script 
using this flag.
-    Note that, in this case, you will want to skip data preparation and 
alignment using
-    `--first-step thrax` (the first step after alignment) and also to specify 
`--no-prepare` so
-    as not to retokenize the data and mess with your alignments.
-    
-    The alignment file format is the standard format where 0-indexed many-many 
alignment pairs for a
-    sentence are provided on a line, source language first, e.g.,
-
-      0-0 0-1 1-2 1-7 ...
-
-    This value is required if you start at the grammar extraction step.
-
-When alignment is complete, the alignment file can be found at 
`$RUNDIR/alignments/training.align`.
-It is parallel to the training corpora.  There are many files in the 
`alignments/` subdirectory that
-contain the output of intermediate steps.
-
-### <a id="parsing" /> 3. PARSING
-
-To build SAMT and GHKM grammars (`--type samt` and `--type ghkm`), the target 
side of the
-training data must be parsed. The pipeline assumes your target side will be 
English, and will parse
-it for you using [the Berkeley 
parser](http://code.google.com/p/berkeleyparser/), which is included.
-If it is not the case that English is your target-side language, the target 
side of your training
-data (found at CORPUS.TARGET) must already be parsed in PTB format.  The 
pipeline will notice that
-it is parsed and will not reparse it.
-
-Parsing is affected by both the `--threads N` and `--jobs N` options.  The 
former runs the parser in
-multithreaded mode, while the latter distributes the runs across as cluster 
(and requires some
-configuration, not yet documented).  The options are mutually exclusive.
-
-Once the parsing is complete, there will be two parsed files:
-
-- `$RUNDIR/data/train/corpus.en.parsed`: this is the mixed-case file that was 
parsed.
-- `$RUNDIR/data/train/corpus.parsed.en`: this is a leaf-lowercased version of 
the above file used for
-  grammar extraction.
-
-## 4. THRAX (grammar extraction) <a id="tm" />
-
-The grammar extraction step takes three pieces of data: (1) the 
source-language training corpus, (2)
-the target-language training corpus (parsed, if an SAMT grammar is being 
extracted), and (3) the
-alignment file.  From these, it computes a synchronous context-free grammar.  
If you already have a
-grammar and wish to skip this step, you can do so passing the grammar with the 
`--grammar
-/path/to/grammar` flag.
-
-The main variable in grammar extraction is Hadoop.  If you have a Hadoop 
installation, simply ensure
-that the environment variable `$HADOOP` is defined, and Thrax will seamlessly 
use it.  If you *do
-not* have a Hadoop installation, the pipeline will roll out out for you, 
running Hadoop in
-standalone mode (this mode is triggered when `$HADOOP` is undefined).  
Theoretically, any grammar
-extractable on a full Hadoop cluster should be extractable in standalone mode, 
if you are patient
-enough; in practice, you probably are not patient enough, and will be limited 
to smaller
-datasets. You may also run into problems with disk space; Hadoop uses a lot 
(use `--tmp
-/path/to/tmp` to specify an alternate place for temporary data; we suggest you 
use a local disk
-partition with tens or hundreds of gigabytes free, and not an NFS partition).  
Setting up your own
-Hadoop cluster is not too difficult a chore; in particular, you may find it 
helpful to install a
-[pseudo-distributed version of 
Hadoop](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html).
-In our experience, this works fine, but you should note the following caveats:
-
-- It is of crucial importance that you have enough physical disks.  We have 
found that having too
-  few, or too slow of disks, results in a whole host of seemingly unrelated 
issues that are hard to
-  resolve, such as timeouts.  
-- NFS filesystems can cause lots of problems.  You should really try to 
install physical disks that
-  are dedicated to Hadoop scratch space.
-
-Here are some flags relevant to Hadoop and grammar extraction with Thrax:
-
-- `--hadoop /path/to/hadoop`
-
-  This sets the location of Hadoop (overriding the environment variable 
`$HADOOP`)
-  
-- `--hadoop-mem MEM` (2g)
-
-  This alters the amount of memory available to Hadoop mappers (passed via the
-  `mapred.child.java.opts` options).
-  
-- `--thrax-conf FILE`
-
-   Use the provided Thrax configuration file instead of the (grammar-specific) 
default.  The Thrax
-   templates are located at 
`$JOSHUA/scripts/training/templates/thrax-TYPE.conf`, where TYPE is one
-   of "hiero" or "samt".
-  
-When the grammar is extracted, it is compressed and placed at 
`$RUNDIR/grammar.gz`.
-
-## <a id="lm" /> 5. Language model
-
-Before tuning can take place, a language model is needed.  A language model is 
always built from the
-target side of the training corpus unless `--no-corpus-lm` is specified.  In 
addition, you can
-provide other language models (any number of them) with the `--lmfile FILE` 
argument.  Other
-arguments are as follows.
-
--  `--lm` {kenlm (default), berkeleylm}
-
-   This determines the language model code that will be used when decoding.  
These implementations
-   are described in their respective papers (PDFs:
-   [KenLM](http://kheafield.com/professional/avenue/kenlm.pdf),
-   
[BerkeleyLM](http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf)). 
KenLM is written in
-   C++ and requires a pass through the JNI, but is recommended because it 
supports left-state minimization.
-   
-- `--lmfile FILE`
-
-  Specifies a pre-built language model to use when decoding.  This language 
model can be in ARPA
-  format, or in KenLM format when using KenLM or BerkeleyLM format when using 
that format.
-
-- `--lm-gen` {kenlm (default), srilm, berkeleylm}, `--buildlm-mem MEM`, 
`--witten-bell`
-
-  At the tuning step, an LM is built from the target side of the training data 
(unless
-  `--no-corpus-lm` is specified).  This controls which code is used to build 
it.  The default is a
-  KenLM's [lmplz](http://kheafield.com/code/kenlm/estimation/), and is 
strongly recommended.
-  
-  If SRILM is used, it is called with the following arguments:
-  
-        $SRILM/bin/i686-m64/ngram-count -interpolate SMOOTHING -order 5 -text 
TRAINING-DATA -unk -lm lm.gz
-        
-  Where SMOOTHING is `-kndiscount`, or `-wbdiscount` if `--witten-bell` is 
passed to the pipeline.
-  
-  [BerkeleyLM java 
class](http://code.google.com/p/berkeleylm/source/browse/trunk/src/edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText.java)
-  is also available. It computes a Kneser-Ney LM with a constant discounting 
(0.75) and no count
-  thresholding.  The flag `--buildlm-mem` can be used to control how much 
memory is allocated to the
-  Java process.  The default is "2g", but you will want to increase it for 
larger language models.
-  
-  A language model built from the target side of the training data is placed 
at `$RUNDIR/lm.gz`.  
-
-## Interlude: decoder arguments
-
-Running the decoder is done in both the tuning stage and the testing stage.  A 
critical point is
-that you have to give the decoder enough memory to run.  Joshua can be very 
memory-intensive, in
-particular when decoding with large grammars and large language models.  The 
default amount of
-memory is 3100m, which is likely not enough (especially if you are decoding 
with SAMT grammar).  You
-can alter the amount of memory for Joshua using the `--joshua-mem MEM` 
argument, where MEM is a Java
-memory specification (passed to its `-Xmx` flag).
-
-## <a id="tuning" /> 6. TUNING
-
-Two optimizers are provided with Joshua: MERT and PRO (`--tuner {mert,pro}`).  
If Moses is
-installed, you can also use Cherry & Foster's k-best batch MIRA (`--tuner 
mira`, recommended).
-Tuning is run till convergence in the `$RUNDIR/tune` directory.
-
-When tuning is finished, each final configuration file can be found at either
-
-    $RUNDIR/tune/joshua.config.final
-
-## <a id="testing" /> 7. Testing 
-
-For each of the tuner runs, Joshua takes the tuner output file and decodes the 
test set.  If you
-like, you can also apply minimum Bayes-risk decoding to the decoder output 
with `--mbr`.  This
-usually yields about 0.3 - 0.5 BLEU points, but is time-consuming.
-
-After decoding the test set with each set of tuned weights, Joshua computes 
the mean BLEU score,
-writes it to `$RUNDIR/test/final-bleu`, and cats it. It also writes a file
-`$RUNDIR/test/final-times` containing a summary of runtime information. That's 
the end of the pipeline!
-
-Joshua also supports decoding further test sets.  This is enabled by rerunning 
the pipeline with a
-number of arguments:
-
--   `--first-step TEST`
-
-    This tells the decoder to start at the test step.
-
--   `--joshua-config CONFIG`
-
-    A tuned parameter file is required.  This file will be the output of some 
prior tuning run.
-    Necessary pathnames and so on will be adjusted.
-    
-## <a id="analysis"> 8. ANALYSIS
-
-If you have used the suggested layout, with a number of related runs all 
contained in a common
-directory with sequential numbers, you can use the script 
`$JOSHUA/scripts/training/summarize.pl` to
-display a summary of the mean BLEU scores from all runs, along with the text 
you placed in the run
-README file (using the pipeline's `--readme TEXT` flag).
-
-## COMMON USE CASES AND PITFALLS 
-
-- If the pipeline dies at the "thrax-run" stage with an error like the 
following:
-
-      JOB FAILED (return code 1) 
-      hadoop/bin/hadoop: line 47: 
-      /some/path/to/a/directory/hadoop/bin/hadoop-config.sh: No such file or 
directory 
-      Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/hadoop/fs/FsShell 
-      Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.fs.FsShell 
-      
-  This occurs if the `$HADOOP` environment variable is set but does not point 
to a working
-  Hadoop installation.  To fix it, make sure to unset the variable:
-  
-      # in bash
-      unset HADOOP
-      
-  and then rerun the pipeline with the same invocation.
-
-- Memory usage is a major consideration in decoding with Joshua and 
hierarchical grammars.  In
-  particular, SAMT grammars often require a large amount of memory.  Many 
steps have been taken to
-  reduce memory usage, including beam settings and test-set- and 
sentence-level filtering of
-  grammars.  However, memory usage can still be in the tens of gigabytes.
-
-  To accommodate this kind of variation, the pipeline script allows you to 
specify both (a) the
-  amount of memory used by the Joshua decoder instance and (b) the amount of 
memory required of
-  nodes obtained by the qsub command.  These are accomplished with the 
`--joshua-mem` MEM and
-  `--qsub-args` ARGS commands.  For example,
-
-      pipeline.pl --joshua-mem 32g --qsub-args "-l pvmem=32g -q himem.q" ...
-
-  Also, should Thrax fail, it might be due to a memory restriction. By 
default, Thrax requests 2 GB
-  from the Hadoop server. If more memory is needed, set the memory requirement 
with the
-  `--hadoop-mem` in the same way as the `--joshua-mem` option is used.
-
-- Other pitfalls and advice will be added as it is discovered.
-
-## FEEDBACK 
-
-Please email [email protected] with problems or suggestions.
-

[31/44] incubator-joshua-site git commit: First attempt

Reply via email to