[18/18] incubator-joshua-site git commit: Initial import of joshua-decoder.github.com site to Apache

lewismc Mon, 04 Apr 2016 22:13:43 -0700

Initial import of joshua-decoder.github.com site to Apache


Project: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/commit/ccc92816
Tree: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/tree/ccc92816
Diff: http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/diff/ccc92816

Branch: refs/heads/asf-site
Commit: ccc928165df2cd288b9fd7152f56a9be6cd3fc33
Parents: 
Author: Lewis John McGibbney <[email protected]>
Authored: Mon Apr 4 22:16:48 2016 -0700
Committer: Lewis John McGibbney <[email protected]>
Committed: Mon Apr 4 22:16:48 2016 -0700

----------------------------------------------------------------------
 4.0/decoder.md                                  |  910 +++
 4.0/faq.md                                      |    7 +
 4.0/features.md                                 |    7 +
 4.0/file-formats.md                             |   78 +
 4.0/index.md                                    |   48 +
 4.0/large-lms.md                                |  192 +
 4.0/lattice.md                                  |   17 +
 4.0/packing.md                                  |   76 +
 4.0/pipeline.md                                 |  576 ++
 4.0/step-by-step-instructions.html              |  908 +++
 4.0/thrax.md                                    |   14 +
 4.0/tms.md                                      |  106 +
 4.0/zmert.md                                    |   83 +
 5.0/advanced.md                                 |    7 +
 5.0/bundle.md                                   |   24 +
 5.0/decoder.md                                  |  374 ++
 5.0/faq.md                                      |    7 +
 5.0/features.md                                 |    6 +
 5.0/file-formats.md                             |   72 +
 5.0/index.md                                    |   77 +
 5.0/jacana.md                                   |  139 +
 5.0/large-lms.md                                |  192 +
 5.0/packing.md                                  |   76 +
 5.0/pipeline.md                                 |  640 ++
 5.0/server.md                                   |   30 +
 5.0/thrax.md                                    |   14 +
 5.0/tms.md                                      |  106 +
 5.0/tutorial.md                                 |  174 +
 5.0/zmert.md                                    |   83 +
 6.0/advanced.md                                 |    7 +
 6.0/bundle.md                                   |  100 +
 6.0/decoder.md                                  |  385 ++
 6.0/faq.md                                      |  161 +
 6.0/features.md                                 |    6 +
 6.0/file-formats.md                             |   72 +
 6.0/index.md                                    |   24 +
 6.0/install.md                                  |   88 +
 6.0/jacana.md                                   |  139 +
 6.0/large-lms.md                                |  192 +
 6.0/packing.md                                  |   74 +
 6.0/pipeline.md                                 |  666 ++
 6.0/quick-start.md                              |   59 +
 6.0/server.md                                   |   30 +
 6.0/thrax.md                                    |   14 +
 6.0/tms.md                                      |  106 +
 6.0/tutorial.md                                 |  187 +
 6.0/whats-new.md                                |   12 +
 6.0/zmert.md                                    |   83 +
 6/advanced.md                                   |    7 +
 6/bundle.md                                     |  100 +
 6/decoder.md                                    |  385 ++
 6/faq.md                                        |  161 +
 6/features.md                                   |    6 +
 6/file-formats.md                               |   72 +
 6/index.md                                      |   24 +
 6/install.md                                    |   88 +
 6/jacana.md                                     |  139 +
 6/large-lms.md                                  |  192 +
 6/packing.md                                    |   74 +
 6/pipeline.md                                   |  666 ++
 6/quick-start.md                                |   59 +
 6/server.md                                     |   30 +
 6/thrax.md                                      |   14 +
 6/tms.md                                        |  106 +
 6/tutorial.md                                   |  187 +
 6/whats-new.md                                  |   12 +
 6/zmert.md                                      |   83 +
 CNAME                                           |    1 +
 README.md                                       |   42 +
 _config.yml                                     |    5 +
 _data/joshua.yaml                               |    2 +
 _layouts/default.html                           |  169 +
 _layouts/default4.html                          |   94 +
 _layouts/default6.html                          |  200 +
 _layouts/documentation.html                     |   60 +
 blog.css                                        |  171 +
 bootstrap/css/bootstrap-responsive.css          | 1109 +++
 bootstrap/css/bootstrap-responsive.min.css      |    9 +
 bootstrap/css/bootstrap.css                     | 6167 +++++++++++++++++
 bootstrap/css/bootstrap.min.css                 |    9 +
 bootstrap/img/glyphicons-halflings-white.png    |  Bin 0 -> 8777 bytes
 bootstrap/img/glyphicons-halflings.png          |  Bin 0 -> 12799 bytes
 bootstrap/js/bootstrap.js                       | 2280 +++++++
 bootstrap/js/bootstrap.min.js                   |    6 +
 contributors.md                                 |   44 +
 data/fisher-callhome-corpus/images/lattice.png  |  Bin 0 -> 22684 bytes
 data/fisher-callhome-corpus/index.html          |   94 +
 data/index.html                                 |    7 +
 data/indian-parallel-corpora/images/map1.png    |  Bin 0 -> 59635 bytes
 data/indian-parallel-corpora/images/map2.png    |  Bin 0 -> 51311 bytes
 data/indian-parallel-corpora/index.html         |  111 +
 devel/index.html                                |   16 +
 dist/css/bootstrap-theme.css                    |  470 ++
 dist/css/bootstrap-theme.css.map                |    1 +
 dist/css/bootstrap-theme.min.css                |    5 +
 dist/css/bootstrap.css                          | 6332 ++++++++++++++++++
 dist/css/bootstrap.css.map                      |    1 +
 dist/css/bootstrap.min.css                      |    5 +
 dist/fonts/glyphicons-halflings-regular.eot     |  Bin 0 -> 20335 bytes
 dist/fonts/glyphicons-halflings-regular.svg     |  229 +
 dist/fonts/glyphicons-halflings-regular.ttf     |  Bin 0 -> 41280 bytes
 dist/fonts/glyphicons-halflings-regular.woff    |  Bin 0 -> 23320 bytes
 dist/js/bootstrap.js                            | 2320 +++++++
 dist/js/bootstrap.min.js                        |    7 +
 dist/js/npm.js                                  |   13 +
 fisher-callhome-corpus/index.html               |    1 +
 images/desert.jpg                               |  Bin 0 -> 121958 bytes
 images/joshua-logo-small.png                    |  Bin 0 -> 29235 bytes
 images/joshua-logo.jpg                          |  Bin 0 -> 236977 bytes
 images/joshua-logo.pdf                          |  Bin 0 -> 1465851 bytes
 images/joshua-logo.png                          |  Bin 0 -> 858713 bytes
 images/logo-credits.txt                         |    1 +
 images/sponsors/NSF-logo.jpg                    |  Bin 0 -> 38008 bytes
 images/sponsors/darpa-logo.jpg                  |  Bin 0 -> 11552 bytes
 images/sponsors/euromatrix.png                  |  Bin 0 -> 59093 bytes
 images/sponsors/hltcoe-logo1.jpg                |  Bin 0 -> 8278 bytes
 images/sponsors/hltcoe-logo1.png                |  Bin 0 -> 22031 bytes
 images/sponsors/hltcoe-logo2.jpg                |  Bin 0 -> 8803 bytes
 images/sponsors/hltcoe-logo2.png                |  Bin 0 -> 9767 bytes
 images/sponsors/hltcoe-logo3.png                |  Bin 0 -> 34899 bytes
 index.md                                        |   43 +
 index5.html                                     |  237 +
 indian-parallel-corpora/index.html              |    1 +
 joshua.bib                                      |   12 +
 joshua.css                                      |   44 +
 joshua4.css                                     |  184 +
 joshua6.css                                     |  220 +
 language-packs.csv                              |    2 +
 language-packs/ar-en-phrase/index.html          |   16 +
 language-packs/es-en-phrase/index.html          |   16 +
 language-packs/index.md                         |   69 +
 language-packs/paraphrase/index.md              |    8 +
 language-packs/zh-en-hiero/index.html           |   16 +
 publications/joshua-2.0.pdf                     |  Bin 0 -> 95757 bytes
 publications/joshua-3.0.pdf                     |  Bin 0 -> 198854 bytes
 ...lkit-for-statistical-machine-translation.pdf |  Bin 0 -> 105762 bytes
 releases.md                                     |   61 +
 releases/5.0/index.html                         |   16 +
 releases/6.0/index.html                         |    4 +
 releases/current/index.html                     |    4 +
 releases/index.md                               |   11 +
 releases/runtime/index.html                     |    4 +
 style.css                                       |  237 +
 support/index.md                                |   25 +
 144 files changed, 31064 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/decoder.md
----------------------------------------------------------------------
diff --git a/4.0/decoder.md b/4.0/decoder.md
new file mode 100644
index 0000000..e3839bf
--- /dev/null
+++ b/4.0/decoder.md
@@ -0,0 +1,910 @@
+---
+layout: default4
+category: links
+title: Decoder configuration parameters
+---
+
+Joshua configuration parameters affect the runtime behavior of the decoder 
itself.  This page
+describes the complete list of these parameters and describes how to invoke 
the decoder manually.
+
+To run the decoder, a convenience script is provided that loads the necessary 
Java libraries.
+Assuming you have set the environment variable `$JOSHUA` to point to the root 
of your installation,
+its syntax is:
+
+    $JOSHUA/joshua-decoder [-m memory-amount] [-c config-file 
other-joshua-options ...]
+    
+The `-m` argument, if present, must come first, and the memory specification 
is in Java format
+(e.g., 400m, 4g, 50g).  Most notably, the suffixes "m" and "g" are used for 
"megabytes" and
+"gigabytes", and there cannot be a space between the number and the unit.  The 
value of this
+argument is passed to Java itself in the invocation of the decoder, and the 
remaining options are
+passed to Joshua.  The `-c` parameter has special import because it specifies 
the location of the
+configuration file.
+
+The Joshua decoder works by reading from STDIN and printing translations to 
STDOUT as they are
+received, according to a number of [output options](#output).  If no run-time 
parameters are
+specified (e.g., no translation model), sentences are simply pushed through 
untranslated.  Blank
+lines are similarly pushed through as blank lines, so as to maintain 
parallelism with the input.
+
+Parameters can be provided to Joshua via a configuration file and from the 
command
+line.  Command-line arguments override values found in the configuration file. 
 The format for
+configuration file parameters is
+
+    parameter = value
+
+Command-line options are specified in the following format
+
+    -parameter value
+
+Values are one of four types (which we list here mostly to call attention to 
the boolean format):
+
+- STRING, an arbitrary string (no spaces)
+- FLOAT, a floating-point value
+- INT, an integer
+- BOOLEAN, a boolean value.  For booleans, `true` evaluates to true, and all 
other values evaluate
+  to false.  For command-line options, the value may be omitted, in which case 
it evaluates to
+  true.  For example, the following are equivalent:
+
+      $JOSHUA/joshua-decoder -show-align-index true
+      $JOSHUA/joshua-decoder -show-align-index
+
+## Joshua configuration file
+
+Before describing the list of Joshua parameters, we present a note about the 
configuration file.
+In addition to the decoder parameters described below, the configuration file 
contains the feature
+weight values for the model.  The weight values are distinguished from runtime 
parameters in two
+ways: (1) they cannot be overridden on the command line, and (2) they do not 
have an equals sign
+(=).  Parameters are described in further detail in the [feature 
file](features.html).  They take
+the following format, and by convention are placed at the end of the 
configuration file:
+
+    lm 0 4.23
+    phrasement pt 0 -0.2
+    oovpenalty -100
+
+## Joshua decoder parameters
+
+This section contains a list of the Joshua run-time parameters.  An important 
note about the
+parameters is that they are collapsed to canonical form, in which dashes (-) 
and underscores (-) are
+removed and case is converted to lowercase.  For example, the following 
parameter forms are
+equivalent (either in the configuration file or from the command line):
+
+    {top-n, topN, top_n, TOP_N, t-o-p-N}
+    {poplimit, pop-limit, pop-limit, popLimit}
+
+This basically defines equivalence classes of parameters, and relieves you of 
the task of having to
+remember the exact format of each parameter.
+
+In what follows, we group the configuration parameters in the following groups:
+
+- [Alternate modes of operation](#modes)
+- [General options](#general)
+- [Pruning](#pruning)
+- [Translation model options](#tm)
+- [Language model options](#lm)
+- [Output options](#output)
+
+<a name="modes" />
+
+### Alternate modes of operation
+
+In addition to decoding (which is the default mode), Joshua can also produce 
synchronous parses of a
+(source,target) pair of sentences.  This mode disables the language model 
(since no generation is
+required) but still requires a translation model.  To enable it, you must do 
two things:
+
+1. Set the configuration parameters `parse = true`.
+2. Provide input in the following format:
+   
+       source sentence ||| target sentence
+       
+You may also wish to display the synchronouse parse tree (`-use-tree-nbest`) 
and the alignment
+(`-show-align-index`).
+
+The synchronous parsing implementation is that of Dyer (2010)
+[PDF](http://www.aclweb.org/anthology/N/N10/N10-1033).
+
+If parsing is enabled, the following features become relevant.  If you would 
like more information
+about how to use these features, please ask [Jonny 
Weese](http://cs.jhu.edu/~jonny/) to document
+them. 
+
+- `forest-pruning` --- *false*
+
+  If true, the synchronous forest will be pruned.
+
+- `forest-pruning-threshold` --- *10*
+
+  The threshold used for pruning.
+  
+- `use-kbest-hg` --- *false*
+
+  The k-best hypergraph to use.
+
+
+<a name="general" />
+
+### General decoder options
+
+- `c`, `config` --- *NULL*
+
+   Specifies the configuration file from which Joshua options are loaded.  
This feature is unique in
+   that it must be specified from the command line.
+
+- `oracle-file` --- *NULL*
+
+  The location of a set of oracle reference translations, parallel to the 
input.  When present,
+  after producing the hypergraph by decoding the input sentence, the oracle is 
used to rescore the
+  translation forest with a BLEU approximation in order to extract the 
oracle-translation from the
+  forest.  This is useful for obtaining an (approximation to an) upper bound 
on your translation
+  model under particular search settings.
+
+- `default-nonterminal` --- *"X"*
+
+   This is the nonterminal symbol assigned to out-of-vocabulary (OOV) items.  
+
+- `goal-symbol` --- *"GOAL"*
+
+   This is the symbol whose presence in the chart over the whole input span 
denotes a successful
+   parse (translation).  It should match the LHS nonterminal in your glue 
grammar.  Internally,
+   Joshua represents nonterminals enclosed in square brackets (e.g., 
"[GOAL]"), which you can
+   optionally supply in the configuration file.
+
+- `true-oovs-only` --- *false*
+
+  By default, Joshua creates an OOV entry for every word in the source 
sentence, regardless of
+  whether it is found in the grammar.  This allows every word to be pushed 
through untranslated
+  (although potentially incurring a high cost based on the `oovPenalty` 
feature).  If this option is
+  set, then only true OOVs are entered into the chart as OOVs.
+
+- `use-sent-specific-tm` --- *false*
+
+  If set to true, Joshua will look for sentence-specific filtered grammars.  
The location is
+  determined by taking the supplied translation model (`tm-file`) and looking 
for a `filtered/`
+  subdirectory for a file with the same name but with the (0-indexed) sentence 
number appended to
+  it.  For example, if 
+  
+      tm-file = /path/to/grammar.gz
+  
+  then the sentence-filtered grammars should be found at
+  
+      /path/to/filtered/grammar.0.gz
+      /path/to/filtered/grammar.1.gz
+      /path/to/filtered/grammar.2.gz      
+      ...
+      
+- `threads`, `num-parallel-decoders` --- *1*
+
+  This determines how many simultaneous decoding threads to launch.  
+  
+  Outputs are assembled in order and Joshua has to hold on to the complete 
target hypergraph until
+  it is ready to be processed for output, so too many simultaneous threads 
could result in lots of
+  memory usage if a long sentence results in many sentences being queued up.  
We have run Joshua
+  with as many as 48 threads without any problems of this kind, but it's 
useful to keep in the back
+  of your mind.
+
+- `oov-feature-cost` --- *100*
+
+  Each OOV word incurs this cost, which is multiplied against the `oovPenalty` 
feature (which is
+  tuned but can be held fixed).
+
+- `use-google-linear-corpus-gain`
+- `google-bleu-weights`
+
+
+<a name="pruning" />
+
+### Pruning options
+  
+There are three different approaches to pruning in Joshua.
+
+1. No pruning.  Exhaustive decoding is triggered by setting `pop-limit = 0` and
+`use-beam-and-threshold-prune = false`.
+
+1. The old approach.  This approach uses a handful of pruning parameters whose 
specific roles are
+hard to understand and whose interaction is even more difficult to quantify.  
It is triggered by
+setting `pop-limit = 0` and `use-beam-and-threshold-prune = true`.
+
+1. Pop-limit pruning (the new approach).  The pop limit determines the number 
of hypotheses that are
+  popped from the candidates list for each of the O(n^2) spans of the input.  
A nice feature of this
+  approach is that it provides a single value to control the size of the 
search space that is
+  explored (and therefore runtime).
+
+Selecting among these pruning methods could be made easier via a single 
parameter with enumerated
+values, but currently, we are stuck with this slightly more cumbersome way.  
The defaults ensure
+that you don't have to worry about them too much.  Pop-limit pruning is 
enabled by default, and it
+is the recommended approach; if you want to control the speed / accuracy 
tradeoff, you should change
+the pop limit.
+
+- `pop-limit` --- *100*
+
+  The number of hypotheses to examine for each span of the input.  Higher 
values result in a larger
+  portion of the search space being explored at the cost of an increased 
search time.
+
+- `use-beam-and-threshold-pruning` --- *false*
+
+  Enables the use of beam-and-threshold pruning, and makes the following five 
features relevant.
+  
+  - `fuzz1` --- *0.1*
+  - `fuzz2` --- *0.2*
+  - `max-n-items` --- *30*
+  - `relative-threshold` --- *10.0*
+  - `max-n-rules` --- *50*
+
+- `constrain-parse` --- *false*
+- `use_pos_labels` --- *false*
+
+
+<a name="tm" />
+
+### Translation model options
+
+At the moment, Joshua supports only two translation models, which are 
designated as the (main)
+translation model and the glue grammar.  Internally, these grammars are 
distinguished only in that
+the `span-limit` parameter applies only to the glue grammar.  In the near 
future we plan to
+generalize the grammar specification to permit an unlimited number of 
translation models.
+
+The main translation grammar is specified with the following set of parameters:
+
+- `tm_file STRING` --- *NULL*, `glue_file STRING` --- *NULL*
+
+  This points to the file location of the translation grammar for text-based 
formats or to the
+  directory for the [packed representation](packing.html).
+  
+- `tm_format STRING` --- *thrax*, `glue_format STRING` --- *thrax*
+
+  The format the file is in.  The permissible formats are `hiero` or `thrax` 
(which are equivalent),
+  `packed` (for [packed grammars](packing.html)), or `samt` (for grammars 
encoded in the format
+  defined by [Zollmann & Venugopal](http://www.cs.cmu.edu/~zollmann/samt/).  
This parameter will be
+  done away with in the near future since it is easily inferrable.  See
+  [the formats page](file-formats.html) for more information about file 
formats.
+
+- `phrase_owner STRING` --- *pt*, `glue-owner STRING` --- *pt*
+
+  The ownership concept is used to distinguish the set of feature weights that 
apply to each
+  grammar.  See the [page on features](features.html) for more information.  
By default, these
+  parameters have the same value, meaning the grammars share a set of features.
+
+- `span-limit` --- *10*
+
+  This controls the maximum span of the input that grammar rules loaded from 
`tm-file` are allowed
+  to apply.  The span limit is ignored for glue grammars.
+
+<a name="lm" />
+
+### Language model options
+
+Joshua supports the incorporation of an arbitrary number of language models.  
To add a language
+model, add a line of the following format to the configuration file:
+
+    lm = lm-type order 0 0 lm-ceiling-cost lm-file
+
+where the six fields correspond to the following values:
+
+* *lm-type*: one of "kenlm", "berkeleylm", "javalm" (not recommended), or 
"none"
+* *order*: the N of the N-gram language model
+* *0*: whether to use left equivalent state (currently not supported)
+* *0*: whether to use right equivalent state (currently not supported)
+* *lm-ceiling-cost*: the LM-specific ceiling cost of any n-gram (currently 
ignored;
+   `lm-ceiling-cost` applies to all language models)
+* *lm-file*: the path to the language model file.  All types support the 
standard ARPA format.
+   Additionally, if the LM type is "kenlm", this file can be compiled into 
KenLM's compiled format
+   (using the program at 
`$JOSHUA/src/joshua/decoder/ff/lm/kenlm/build_binary`), and if the LM type
+   is "berkeleylm", it can be compiled by following the directions in
+   `$JOSHUA/src/joshua/decoder/ff/lm/berkeley_lm/README`.
+
+For each language model, you need to specify a feature weight in the following 
format:
+
+    lm 0 WEIGHT
+    lm 1 WEIGHT
+    ...
+    
+where the indices correspond to the language model declaration lines in order.
+
+For backwards compatibility, Joshua also supports a separate means of 
specifying the language model,
+by separately specifying each of `lm-file` (NULL), `lm-type` (kenlm), `order` 
(5), and
+`lm-ceiling-cost` (100).
+
+
+<a name="output" />
+
+### Output options
+
+The output for a given input is a set of one or more lines with the following 
scheme:
+
+    input ID ||| translation ||| model scores ||| score
+
+These parameters largely determine what is output by Joshua.
+
+- `top-n` --- *300*
+
+  The number of translation hypotheses to output, sorted in non-increasing 
order of model score (i.e.,
+  highest first).
+
+- `use-unique-nbest` --- *true*
+
+  When constructing the n-best list for a sentence, skip hypotheses whose 
string has already been
+  output.  This increases the amount of diversity in the n-best list by 
removing spurious ambiguity
+  in the derivation structures.
+
+- `add-combined-cost` --- *true*
+
+  In addition to outputting the hypothesis number, the translation, and the 
individual feature
+  weights, output the combined model cost.
+
+- `use-tree-nbest` --- *false* 
+
+  Output the synchronous derivation tree in addition to the output string, for 
each candidate in the
+  n-best list.
+
+- `escape-trees` --- *false*
+
+
+- `include-align-index` --- *false*
+
+  Output the source words indices that each target word aligns to.
+
+- `mark-oovs` --- *false*
+
+  if `true`, this causes the text "_OOV" to be appended to each OOV in the 
output.
+
+- `visualize-hypergraph` --- *false*
+
+  If set to true, a visualization of the hypergraph will be displayed, though 
you will have to
+  explicitly include the relevant jar files.  See the example usage in
+  `$JOSHUA/examples/tree_visualizer/`, which contains a demonstration of a 
source sentence,
+  translation, and synchronous derivation.
+
+- `save-disk-hg` --- *false* [DISABLED]
+
+  This feature directs that the hypergraph should be written to disk.  The 
code is in
+  
+      $JOSHUA/src/joshua/src/DecoderThread.java
+      
+  but the feature has not been tested in some time, and is thus disabled.  It 
probably wouldn't take
+  much work to fix it!  If you do, you might find the
+  [discussion on a common hypergraph 
format](http://aclweb.org/aclwiki/index.php?title=Hypergraph_Format)
+  on the ACL Wiki to be useful.
+
+<!--
+
+## Full list of command-line options and arguments
+
+<table border="0">
+  <tr>
+    <th>
+      option
+    </th>
+    <th>
+      value
+    </th>
+    <th>
+      description
+    </th>
+  </tr>
+
+  <tr>
+    <td>
+      <code>-lm</code>
+    </td>
+    <td>
+      String, e.g. <n /> <code>TYPE 5 false false 100 FILE</code>
+    </td>
+    <td markdown="1">
+      Use once for each of one or language models.
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>-lm_file</code>
+    </td>
+    <td>
+      String: path the the language model file
+    </td>
+    <td>
+      ???
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>-parse</code>
+    </td>
+    <td>
+      None
+    </td>
+    <td>
+      whether to parse (if not then decode)
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>-tm_file</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+       path to the the translation model
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>-glue_file</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      ???
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>-tm_format</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>-glue_format</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>-lm_type</code>
+    </td>
+    <td>
+      value
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+  <tr>
+    <td>
+      <code>lm_ceiling_cost</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>use_left_equivalent_state</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>use_right_equivalent_state</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>order</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>use_sent_specific_lm</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>span_limit</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>phrase_owner</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>glue_owner</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>default_non_terminal</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>goalSymbol</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>constrain_parse</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>oov_feature_index</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>true_oovs_only</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>use_pos_labels</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>fuzz1</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>fuzz2</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>max_n_items</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>relative_threshold</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>max_n_rules</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>use_unique_nbest</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>add_combined_cost</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>use_tree_nbest</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>escape_trees</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>include_align_index</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>top_n</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>parallel_files_prefix</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>num_parallel_decoders</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>threads</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>save_disk_hg</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>use_kbest_hg</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>forest_pruning</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>forest_pruning_threshold</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>visualize_hypergraph</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>mark_oovs</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>pop-limit</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+
+  <tr>
+    <td>
+      <code>useCubePrune</code>
+    </td>
+    <td>
+      String
+    </td>
+    <td>
+      description
+    </td>
+  </tr>
+</table>
+-->
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/faq.md
----------------------------------------------------------------------
diff --git a/4.0/faq.md b/4.0/faq.md
new file mode 100644
index 0000000..f0a4151
--- /dev/null
+++ b/4.0/faq.md
@@ -0,0 +1,7 @@
+---
+layout: default4
+category: help
+title: Common problems
+---
+
+Solutions to common problems will be posted here as we become aware of them.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/features.md
----------------------------------------------------------------------
diff --git a/4.0/features.md b/4.0/features.md
new file mode 100644
index 0000000..d915c82
--- /dev/null
+++ b/4.0/features.md
@@ -0,0 +1,7 @@
+---
+layout: default4
+category: links
+title: Features
+---
+
+This file will contain information about the Joshua decoder features.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/file-formats.md
----------------------------------------------------------------------
diff --git a/4.0/file-formats.md b/4.0/file-formats.md
new file mode 100644
index 0000000..c10f906
--- /dev/null
+++ b/4.0/file-formats.md
@@ -0,0 +1,78 @@
+---
+layout: default4
+category: advanced
+title: Joshua file formats
+---
+This page describes the formats of Joshua configuration and support files.
+
+## Translation models (grammars)
+
+Joshua supports three grammar file formats.
+
+1. Thrax / Hiero
+1. SAMT [deprecated]
+1. packed
+
+The *Hiero* format is not restricted to Hiero grammars, but simply means *the 
format that David
+Chiang developed for Hiero*.  It can support a much broader class of SCFGs 
containing an arbitrary
+set of nonterminals.  Similarly, the *SAMT* format is not restricted to SAMT 
grammars but instead
+simply denotes *the grammar format that Zollmann and Venugopal developed for 
their decoder*.  To
+remove this source of confusion, "thrax" is the preferred format designation, 
and is in fact the
+default.
+
+The packed grammar format is the efficient grammar representation developed by
+[Juri Ganitkevich](http://cs.jhu.edu/~juri) [is described in detail 
elsewhere](packing.html).
+
+Grammar rules in the Thrax format follow this format:
+
+    [LHS] ||| SOURCE-SIDE ||| TARGET-SIDE ||| FEATURES
+    
+Here are some two examples, one for a Hiero grammar, and the other for an SAMT 
grammar:
+
+    [X] ||| el chico [X] ||| the boy [X] ||| -3.14 0 2 17
+    [S] ||| el chico [VP] ||| the boy [VP] ||| -3.14 0 2 17
+    
+The feature values can have optional labels, e.g.:
+
+    [X] ||| el chico [X] ||| the boy [X] ||| lexprob=-3.14 abstract=0 
numwords=2 count=17
+    
+These feature names are made up.  For an actual list of feature names, please
+[see the Thrax documentation](thrax.html).
+
+The SAMT grammar format is deprecated and undocumented.
+
+## Language Model
+
+Joshua has three language model implementations: [KenLM](), [BerkeleyLM](), 
and an (unrecommended)
+dummy Java implementation.  All language model implementations support the 
standard ARPA format
+output by [SRILM]().  In addition, KenLM and BerkeleyLM support compiled 
formats that can be loaded
+more quickly and efficiently.
+
+### Compiling for KenLM
+
+To compile an ARPA grammar for KenLM, use the (provided) `build-binary` 
command, located deep within
+the Joshua source code:
+
+    $JOSHUA/src/joshua/decoder/ff/lm/kenlm/build_binary lm.arpa lm.kenlm
+    
+This script takes the `lm.arpa` file and produces the compiled version in 
`lm.kenlm`.
+
+### Compiling for BerkeleyLM
+
+To compile a grammar for BerkeleyLM, type:
+
+    java -cp $JOSHUA/lib/berkeleylm.jar -server -mxMEM 
edu.berkeley.nlp.lm.io.MakeLmBinaryFromArpa lm.arpa lm.berkeleylm
+
+The `lm.berkeleylm` file can then be listed directly in the [Joshua 
configuration file](decoder.html).
+
+## Joshua configuration
+
+See [the decoder page](decoder.html).
+
+## Pipeline configuration
+
+See [the pipeline page](pipeline.html).
+
+## Thrax configuration
+
+See [the thrax page](thrax.html).

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/index.md
----------------------------------------------------------------------
diff --git a/4.0/index.md b/4.0/index.md
new file mode 100644
index 0000000..ae62e4e
--- /dev/null
+++ b/4.0/index.md
@@ -0,0 +1,48 @@
+---
+layout: default4
+title: Joshua 4.0 User Documentation
+---
+
+This page contains end-user oriented documentation for the 4.0 release of
+[the Joshua decoder](http://joshua-decoder.org/).
+
+## Download and Setup
+
+1. Follow [this link](http://cs.jhu.edu/~post/files/joshua-4.0.tgz) to 
download Joshua, or do it
+from the command line:
+
+       wget -q http://cs.jhu.edu/~post/files/joshua-4.0.tgz
+
+2. Next, unpack it, set the `$JOSHUA` environment variable, and compile 
everything:
+
+       tar xzf joshua-4.0.tgz
+       cd joshua-4.0
+
+       # for bash
+       export JOSHUA=$(pwd)
+       echo "export JOSHUA=$JOSHUA" >> ~/.bashrc
+
+       # for tcsh
+       setenv JOSHUA `pwd`
+       echo "setenv JOSHUA $JOSHUA" >> ~/.profile
+       
+       ant all
+
+3. That's it.
+
+## Quick start
+
+If you just want to run the complete machine translation pipeline (beginning 
with data preparation,
+through alignment, hierarchical model building, tuning, testing, and 
reporting), we recommend you
+use our <a href="pipeline.html">pipeline script</a>.  You might also be 
interested in
+[Chris' old walkthrough](http://cs.jhu.edu/~ccb/joshua/).
+
+## More information
+
+For more detail on the decoder itself, including its command-line options, see
+[the Joshua decoder page](decoder.html).  You can also learn more about other 
steps of
+[the Joshua MT pipeline](pipeline.html), including [grammar 
extraction](thrax.html) with Thrax and
+Joshua's [efficient grammar representation](packing.html) (new with version 
4.0).
+
+If you have problems or issues, you might find some help [on our answers 
page](faq.html) or
+[in the mailing list 
archives](https://groups.google.com/forum/?fromgroups#!forum/joshua_support).

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/large-lms.md
----------------------------------------------------------------------
diff --git a/4.0/large-lms.md b/4.0/large-lms.md
new file mode 100644
index 0000000..a4ba5b7
--- /dev/null
+++ b/4.0/large-lms.md
@@ -0,0 +1,192 @@
+---
+layout: default4
+title: Building large LMs with SRILM
+category: advanced
+---
+
+The following is a tutorial for building a large language model from the
+English Gigaword Fifth Edition corpus
+[LDC2011T07](http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07)
+using SRILM. English text is provided from seven different sources.
+
+### Step 0: Clean up the corpus
+
+The Gigaword corpus has to be stripped of all SGML tags and tokenized.
+Instructions for performing those steps are not included in this
+documentation. A description of this process can be found in a paper
+called ["Annotated
+Gigaword"](https://akbcwekex2012.files.wordpress.com/2012/05/28_paper.pdf).
+
+The Joshua package ships with a script that converts all alphabetical
+characters to their lowercase equivalent. The script is located at
+`$JOSHUA/scripts/lowercase.perl`.
+
+Make a directory structure as follows:
+
+    gigaword/
+    âââ corpus/
+    âÂ Â  âââ afp_eng/
+    âÂ Â  âÂ Â  âââ afp_eng_199405.lc.gz
+    âÂ Â  âÂ Â  âââ afp_eng_199406.lc.gz
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ apw_eng/
+    âÂ Â  âÂ Â  âââ apw_eng_199411.lc.gz
+    âÂ Â  âÂ Â  âââ apw_eng_199412.lc.gz
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ cna_eng/
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ ltw_eng/
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ nyt_eng/
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ wpb_eng/
+    âÂ Â  âÂ Â  âââ ...
+    âÂ Â  âÂ Â  âââ counts/
+    âÂ Â  âââ xin_eng/
+    âÂ Â   Â Â  âââ ...
+    âÂ Â   Â Â  âââ counts/
+    âââ lm/
+     Â Â  âââ afp_eng/
+     Â Â  âââ apw_eng/
+     Â Â  âââ cna_eng/
+     Â Â  âââ ltw_eng/
+     Â Â  âââ nyt_eng/
+     Â Â  âââ wpb_eng/
+     Â Â  âââ xin_eng/
+
+
+The next step will be to build smaller LMs and then interpolate them into one
+file.
+
+### Step 1: Count ngrams
+
+Run the following script once from each source directory under the `corpus/`
+directory (edit it to specify the path to the `ngram-count` binary as well as
+the number of processors):
+
+    #!/bin/sh
+
+    NGRAM_COUNT=$SRILM_SRC/bin/i686-m64/ngram-count
+    args=""
+
+    for source in *.gz; do
+       args=$args"-sort -order 5 -text $source -write counts/$source-counts.gz 
"
+    done
+
+    echo $args | xargs --max-procs=4 -n 7 $NGRAM_COUNT
+
+Then move each `counts/` directory to the corresponding directory under
+`lm/`. Now that each ngram has been counted, we can make a language
+model for each of the seven sources.
+
+### Step 2: Make individual language models
+
+SRILM includes a script, called `make-big-lm`, for building large language
+models under resource-limited environments. The manual for this script can be
+read online
+[here](http://www-speech.sri.com/projects/srilm/manpages/training-scripts.1.html).
+Since the Gigaword corpus is so large, it is convenient to use `make-big-lm`
+even in environments with many parallel processors and a lot of memory.
+
+Initiate the following script from each of the source directories under the
+`lm/` directory (edit it to specify the path to the `make-big-lm` script as
+well as the pruning threshold):
+
+    #!/bin/bash
+    set -x
+
+    CMD=$SRILM_SRC/bin/make-big-lm
+    PRUNE_THRESHOLD=1e-8
+
+    $CMD \
+      -name gigalm `for k in counts/*.gz; do echo " \
+      -read $k "; done` \
+      -lm lm.gz \
+      -max-per-file 100000000 \
+      -order 5 \
+      -kndiscount \
+      -interpolate \
+      -unk \
+      -prune $PRUNE_THRESHOLD
+
+The language model attributes chosen are the following:
+
+* N-grams up to order 5
+* Kneser-Ney smoothing
+* N-gram probability estimates at the specified order *n* are interpolated with
+  lower-order estimates
+* include the unknown-word token as a regular word
+* pruning N-grams based on the specified threshold
+
+Next, we will mix the models together into a single file.
+
+### Step 3: Mix models together
+
+Using development text, interpolation weights can determined that give highest
+weight to the source language models that have the lowest perplexity on the
+specified development set.
+
+#### Step 3-1: Determine interpolation weights
+
+Initiate the following script from the `lm/` directory (edit it to specify the
+path to the `ngram` binary as well as the path to the development text file):
+
+    #!/bin/bash
+    set -x
+
+    NGRAM=$SRILM_SRC/bin/i686-m64/ngram
+    DEV_TEXT=~mpost/expts/wmt12/runs/es-en/data/tune/tune.tok.lc.es
+
+    dirs=( afp_eng apw_eng cna_eng ltw_eng nyt_eng wpb_eng xin_eng )
+
+    for d in ${dirs[@]} ; do
+      $NGRAM -debug 2 -order 5 -unk -lm $d/lm.gz -ppl $DEV_TEXT > $d/lm.ppl ;
+    done
+
+    compute-best-mix */lm.ppl > best-mix.ppl
+
+Take a look at the contents of `best-mix.ppl`. It will contain a sequence of
+values in parenthesis. These are the interpolation weights of the source
+language models in the order specified. Copy and paste the values within the
+parenthesis into the script below.
+
+#### Step 3-2: Combine the models
+
+Initiate the following script from the `lm/` directory (edit it to specify the
+path to the `ngram` binary as well as the interpolation weights):
+
+    #!/bin/bash
+    set -x
+
+    NGRAM=$SRILM_SRC/bin/i686-m64/ngram
+    DIRS=(   afp_eng    apw_eng     cna_eng  ltw_eng   nyt_eng  wpb_eng  
xin_eng )
+    LAMBDAS=(0.00631272 0.000647602 0.251555 0.0134726 0.348953 0.371566 
0.00749238)
+
+    $NGRAM -order 5 -unk \
+      -lm      ${DIRS[0]}/lm.gz     -lambda  ${LAMBDAS[0]} \
+      -mix-lm  ${DIRS[1]}/lm.gz \
+      -mix-lm2 ${DIRS[2]}/lm.gz -mix-lambda2 ${LAMBDAS[2]} \
+      -mix-lm3 ${DIRS[3]}/lm.gz -mix-lambda3 ${LAMBDAS[3]} \
+      -mix-lm4 ${DIRS[4]}/lm.gz -mix-lambda4 ${LAMBDAS[4]} \
+      -mix-lm5 ${DIRS[5]}/lm.gz -mix-lambda5 ${LAMBDAS[5]} \
+      -mix-lm6 ${DIRS[6]}/lm.gz -mix-lambda6 ${LAMBDAS[6]} \
+      -write-lm mixed_lm.gz
+
+The resulting file, `mixed_lm.gz` is a language model based on all the text in
+the Gigaword corpus and with some probabilities biased to the development text
+specify in step 3-1. It is in the ARPA format. The optional next step converts
+it into KenLM format.
+
+#### Step 3-3: Convert to KenLM
+
+The KenLM format has some speed advantages over the ARPA format. Issuing the
+following command will write a new language model file `mixed_lm-kenlm.gz` that
+is the `mixed_lm.gz` language model transformed into the KenLM format.
+
+    $JOSHUA/src/joshua/decoder/ff/lm/kenlm/build_binary mixed_lm.gz 
mixed_lm.kenlm
+

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/lattice.md
----------------------------------------------------------------------
diff --git a/4.0/lattice.md b/4.0/lattice.md
new file mode 100644
index 0000000..5d6bd47
--- /dev/null
+++ b/4.0/lattice.md
@@ -0,0 +1,17 @@
+---
+layout: default4
+category: advanced
+title: Lattice decoding
+---
+
+In addition to regular sentences, Joshua can decode weighted lattices encoded 
in [the PLF
+format](http://www.statmt.org/moses/?n=Moses.WordLattices).  Lattice decoding 
was originally added
+by Lane Schwartz and [Chris Dyer](http://www.cs.cmu.edu/~cdyer/).
+
+Joshua will automatically detect whether the input sentence is a regular 
sentence
+(the usual case) or a lattice.  If a lattice, a feature will be activated that 
accumulates the cost
+of different paths through the lattice.  In this case, you need to ensure that 
a weight for this
+feature is present in [your model file](decoder.html).
+
+The main caveats with Joshua's PLF lattice support are that the lattice needs 
to be listed on a
+single line.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/packing.md
----------------------------------------------------------------------
diff --git a/4.0/packing.md b/4.0/packing.md
new file mode 100644
index 0000000..9318f6e
--- /dev/null
+++ b/4.0/packing.md
@@ -0,0 +1,76 @@
+---
+layout: default4
+category: advanced
+title: Grammar Packing
+---
+
+Grammar packing refers to the process of taking a textual grammar output by 
[Thrax](thrax.html) and
+efficiently encoding it for use by Joshua.  Packing the grammar results in 
significantly faster load
+times for very large grammars.
+
+Soon, the [Joshua pipeline script](pipeline.html) will add support for grammar 
packing
+automatically, and we will provide a script that automates these steps for you.
+
+1. Make sure the grammar is labeled.  A labeled grammar is one that has 
feature names attached to
+each of the feature values in each row of the grammar file.  Here is a line 
from an unlabeled
+grammar:
+
+        [X] ||| [X,1] à¦à¦¨à§à¦¯à¦¾à¦¨à§à¦¯ [X,2] ||| [X,1] other [X,2] ||| 
0 0 1 0 0 1.02184
+
+   and here is one from an labeled grammar (note that the labels are not very 
useful):
+
+        [X] ||| [X,1] à¦à¦¨à§à¦¯à¦¾à¦¨à§à¦¯ [X,2] ||| [X,1] other [X,2] ||| 
f1=0 f2=0 f3=1 f4=0 f5=0 f6=1.02184
+
+   If your grammar is not labeled, you can use the script 
`$JOSHUA/scripts/label_grammar.py`:
+   
+        zcat grammar.gz | $JOSHUA/scripts/label_grammar.py > grammar-labeled.gz
+
+   As a side-effect of this step is to produce a file 'dense_map' in the 
current directory,
+   containing the mapping between feature names and feature columns.  This 
file is needed in later
+   steps.
+
+1. The packer needs a sorted grammar.  It is sufficient to sort by the first 
word:
+
+        zcat grammar-labeled.gz | sort -k3,3 | gzip > grammar-sorted.gz
+      
+   (The reason we need a sorted grammar is because the packer stores the 
grammar in a trie.  The
+   pieces can't be more than 2 GB due to Java limitations, so we need to 
ensure that rules are
+   grouped by the first arc in the trie to avoid redundancy across tries and 
to simplify the
+   lookup).
+    
+1. In order to pack the grammar, we need two pieces of information: (1) a 
packer configuration file,
+   and (2) a dense map file.
+
+   1. Write a packer config file.  This file specifies items such as the chunk 
size (for the packed
+      pieces) and the quantization classes and types for each feature name.  
Examples can be found
+      at
+   
+            $JOSHUA/test/packed/packer.config
+            $JOSHUA/test/bn-en/packed/packer.quantized
+            $JOSHUA/test/bn-en/packed/packer.uncompressed
+       
+      The quantizer lines in the packer config file have the following format:
+   
+            quantizer TYPE FEATURES
+       
+       where `TYPE` is one of `boolean`, `float`, `byte`, or `8bit`, and 
`FEATURES` is a
+       space-delimited list of feature names that have that quantization type.
+   
+   1. Write a dense_map file.  If you labeled an unlabeled grammar, this was 
produced for you as a
+      side product of the `label_grammar.py` script you called in Step 1.  
Otherwise, you need to
+      create a file that lists the mapping between feature names and 
(0-indexed) columns in the
+      grammar, one per line, in the following format:
+   
+            feature-index feature-name
+    
+1. To pack the grammar, type the following command:
+
+        java -cp $JOSHUA/bin joshua.tools.GrammarPacker -c PACKER_CONFIG_FILE 
-p OUTPUT_DIR -g GRAMMAR_FILE
+
+    This will read in your packer configuration file and your grammar, and 
produced a packed grammar
+    in the output directory.
+
+1. To use the packed grammar, just point to the packed directory in your 
Joshua configuration file.
+
+        tm-file = packed-grammar/
+        tm-format = packed

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/ccc92816/4.0/pipeline.md
----------------------------------------------------------------------
diff --git a/4.0/pipeline.md b/4.0/pipeline.md
new file mode 100644
index 0000000..33eafb3
--- /dev/null
+++ b/4.0/pipeline.md
@@ -0,0 +1,576 @@
+---
+layout: default4
+category: links
+title: The Joshua Pipeline
+---
+
+This page describes the Joshua pipeline script, which manages the complexity 
of training and
+evaluating machine translation systems.  The pipeline eases the pain of two 
related tasks in
+statistical machine translation (SMT) research:
+
+1. Training SMT systems involves a complicated process of interacting steps 
that are time-consuming
+and prone to failure.
+
+1. Developing and testing new techniques requires varying parameters at 
different points in the
+pipeline.  Earlier results (which are often expensive) need not be recomputed.
+
+To facilitate these tasks, the pipeline script:
+- Runs the complete SMT pipeline, from corpus normalization and tokenization, 
through model
+  building, tuning, test-set decoding, and evaluation.
+
+- Caches the results of intermediate steps (using robust SHA-1 checksums on 
dependencies), so the
+  pipeline can be debugged or shared across similar runs with (almost) no time 
spent recomputing
+  expensive steps.
+ 
+- Allows you to jump into and out of the pipeline at a set of predefined 
places (e.g., the alignment
+  stage), so long as you provide the missing dependencies.
+
+The Joshua pipeline script is designed in the spirit of Moses' 
`train-model.pl`, and shares many of
+its features.  It is not as extensive, however, as Moses'
+[Experiment Management 
System](http://www.statmt.org/moses/?n=FactoredTraining.EMS).
+
+## Installation
+
+The pipeline has no *required* external dependencies.  However, it has support 
for a number of
+external packages, some of which are included with Joshua.
+
+-  [GIZA++](http://code.google.com/p/giza-pp/)
+
+   GIZA++ is the default aligner.  It is included with Joshua, and should 
compile successfully when
+   you typed `ant all` from the Joshua root directory.  It is not required 
because you can use the
+   (included) Berkeley aligner (`--aligner berkeley`).
+
+-  [SRILM](http://www.speech.sri.com/projects/srilm/)
+
+   By default, the pipeline uses a Java program from the
+   [Berkeley LM](http://code.google.com/p/berkeleylm/) package that constructs 
an
+   Kneser-Ney-smoothed language model in ARPA format from the target side of 
your training data.  If
+   you wish to use SRILM instead, you need to do the following:
+   
+   1. Install SRILM and set the `$SRILM` environment variable to point to its 
installed location.
+   1. Add the `--lm-gen srilm` flag to your pipeline invocation.
+   
+   More information on this is available in the [LM building section of the 
pipeline](#lm).  SRILM
+   is not used for representing language models during decoding (and in fact 
is not supported,
+   having been supplanted by [KenLM](http://kheafield.com/code/kenlm/) and 
BerkeleyLM).
+
+-  [Hadoop](http://hadoop.apache.org/)
+
+   The pipeline uses the [Thrax grammar extractor](thrax.html), which is built 
on Hadoop.  If you
+   have a Hadoop installation, simply ensure that the `$HADOOP` environment 
variable is defined, and
+   the pipeline will use it automatically at the grammar extraction step.  If 
you are going to
+   attempt to extract very large grammars, it is best to have a good-sized 
Hadoop installation.
+   
+   (If you do not have a Hadoop installation, you might consider setting one 
up.  Hadoop can be
+   installed in a
+   
["pseudo-distributed"](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed)
+   mode that allows it to use just a few machines or a number of processors on 
a single machine.
+   The main issue is to ensure that there are a lot of independent physical 
disks, since in our
+   experience Hadoop starts to exhibit lots of hard-to-trace problems if there 
is too much demand on
+   the disks.)
+   
+   If you don't have a Hadoop installation, there are still no worries.  The 
pipeline will unroll a
+   standalone installation and use it to extract your grammar.  This behavior 
will be triggered if
+   `$HADOOP` is undefined.
+
+Make sure that the environment variable `$JOSHUA` is defined, and you should 
be all set.
+
+## A basic pipeline run
+
+The pipeline takes a set of inputs (training, tuning, and test data), and 
creates a set of
+intermediate files in the *run directory*.  By default, the run directory is 
the current directory,
+but it can be changed with the `--rundir` parameter.
+
+For this quick start, we will be working with the example that can be found in
+`$JOSHUA/examples/pipeline`.  This example contains 1,000 sentences of 
Urdu-English data (the full
+dataset is available as part of the
+[Indian languages parallel 
corpora](http://joshua-decoder.org/indian-parallel-corpora/) with
+100-sentence tuning and test sets with four references each.
+
+Running the pipeline requires two main steps: data preparation and invocation.
+
+1. Prepare your data.  The pipeline script needs to be told where to find the 
raw training, tuning,
+   and test data.  A good convention is to place these files in an input/ 
subdirectory of your run's
+   working directory (NOTE: do not use `data/`, since a directory of that name 
is created and used
+   by the pipeline itself).  The expected format (for each of training, 
tuning, and test) is a pair
+   of files that share a common path prefix and are distinguished by their 
extension:
+
+       input/
+             train.SOURCE
+             train.TARGET
+             tune.SOURCE
+             tune.TARGET
+             test.SOURCE
+             test.TARGET
+
+   These files should be parallel at the sentence level (with one sentence per 
line), should be in
+   UTF-8, and should be untokenized (tokenization occurs in the pipeline).  
SOURCE and TARGET denote
+   variables that should be replaced with the actual target and source 
language abbreviations (e.g.,
+   "ur" and "en").
+   
+1. Run the pipeline.  The following is the minimal invocation to run the 
complete pipeline:
+
+       $JOSHUA/scripts/training/pipeline.pl  \
+         --corpus input/train                \
+         --tune input/tune                   \
+         --test input/devtest                \
+         --source SOURCE                     \
+         --target TARGET
+
+   The `--corpus`, `--tune`, and `--test` flags define file prefixes that are 
concatened with the
+   language extensions given by `--target` and `--source` (with a "." in 
betwee).  Note the
+   correspondences with the files defined in the first step above.  The 
prefixes can be either
+   absolute or relative pathnames.  This particular invocation assumes that a 
subdirectory `input/`
+   exists in the current directory, that you are translating from a language 
identified "ur"
+   extension to a language identified by the "en" extension, that the training 
data can be found at
+   `input/train.en` and `input/train.ur`, and so on.
+   
+Assuming no problems arise, this command will run the complete pipeline in 
about 20 minutes,
+producing BLEU scores at the end.  As it runs, you will see output that looks 
like the following:
+   
+    [train-copy-en] rebuilding...
+      dep=/Users/post/code/joshua/test/pipeline/input/train.en 
+      dep=data/train/train.en.gz [NOT FOUND]
+      cmd=cat /Users/post/code/joshua/test/pipeline/input/train.en | gzip -9n 
> data/train/train.en.gz
+      took 0 seconds (0s)
+    [train-copy-ur] rebuilding...
+      dep=/Users/post/code/joshua/test/pipeline/input/train.ur 
+      dep=data/train/train.ur.gz [NOT FOUND]
+      cmd=cat /Users/post/code/joshua/test/pipeline/input/train.ur | gzip -9n 
> data/train/train.ur.gz
+      took 0 seconds (0s)
+    ...
+   
+And in the current directory, you will see the following files (among other 
intermediate files
+generated by the individual sub-steps).
+   
+    data/
+        train/
+            corpus.ur
+            corpus.en
+            thrax-input-file
+        tune/
+            tune.tok.lc.ur
+            tune.tok.lc.en
+            grammar.filtered.gz
+            grammar.glue
+        test/
+            test.tok.lc.ur
+            test.tok.lc.en
+            grammar.filtered.gz
+            grammar.glue
+    alignments/
+        0/
+            [berkeley aligner output files]
+        training.align
+    thrax-hiero.conf
+    thrax.log
+    grammar.gz
+    lm.gz
+    tune/
+        1/
+            decoder_command
+            joshua.config
+            params.txt
+            joshua.log
+            mert.log
+            joshua.config.ZMERT.final
+        final-bleu
+
+These files will be described in more detail in subsequent sections of this 
tutorial.
+
+Another useful flag is the `--rundir DIR` flag, which chdir()s to the 
specified directory before
+running the pipeline.  By default the rundir is the current directory.  
Changing it can be useful
+for organizing related pipeline runs.  Relative paths specified to other flags 
(e.g., to `--corpus`
+or `--lmfile`) are relative to the directory the pipeline was called *from*, 
not the rundir itself
+(unless they happen to be the same, of course).
+
+The complete pipeline comprises many tens of small steps, which can be grouped 
together into a set
+of traditional pipeline tasks:
+   
+1. [Data preparation](#prep)
+1. [Alignment](#alignment)
+1. [Parsing](#parsing)
+1. [Grammar extraction](#tm)
+1. [Language model building](#lm)
+1. [Tuning](#tuning)
+1. [Testing](#testing)
+
+These steps are discussed below, after a few intervening sections about 
high-level details of the
+pipeline.
+
+## Grammar options
+
+Joshua can extract two types of grammars: Hiero-style grammars and SAMT 
grammars.  As described on
+the [file formats page](file-formats.html), both of them are encoded into the 
same file format, but
+they differ in terms of the richness of their nonterminal sets.
+
+Hiero grammars make use of a single nonterminals, and are extracted by 
computing phrases from
+word-based alignments and then subtracting out phrase differences.  More 
detail can be found in
+[Chiang (2007) 
[PDF]](http://www.mitpressjournals.org/doi/abs/10.1162/coli.2007.33.2.201).
+[SAMT grammars](http://www.cs.cmu.edu/~zollmann/samt/) make use of a source- 
or target-side parse
+tree on the training data, projecting constituent labels down on the phrasal 
alignments in a variety
+of configurations.  SAMT grammars are usually many times larger and are much 
slower to decode with,
+but sometimes increase BLEU score.  Both grammar formats are extracted with the
+[Thrax software](thrax.html).
+
+By default, the Joshua pipeline extract a Hiero grammar, but this can be 
altered with the `--type
+samt` flag.
+
+## Other high-level options
+
+The following command-line arguments control run-time behavior of multiple 
steps:
+
+- `--threads N` (1)
+
+  This enables multithreaded operation for a number of steps: alignment (with 
GIZA, max two
+  threads), parsing, and decoding (any number of threads)
+  
+- `--jobs N` (1)
+
+  This enables parallel operation over a cluster using the qsub command.  This 
feature is not
+  well-documented at this point, but you will likely want to edit the file
+  `$JOSHUA/scripts/training/parallelize/LocalConfig.pm` to setup your qsub 
environment, and may also
+  want to pass specific qsub commands via the `--qsub-args "ARGS"` command.
+
+## Restarting failed runs
+
+If the pipeline dies, you can restart it with the same command you used the 
first time.  If you
+rerun the pipeline with the exact same invocation as the previous run (or an 
overlapping
+configuration -- one that causes the same set of behaviors), you will see 
slightly different
+output compared to what we saw above:
+
+    [train-copy-en] cached, skipping...
+    [train-copy-ur] cached, skipping...
+    ...
+
+This indicates that the caching module has discovered that the step was 
already computed and thus
+did not need to be rerun.  This feature is quite useful for restarting 
pipeline runs that have
+crashed due to bugs, memory limitations, hardware failures, and the myriad 
other problems that
+plague MT researchers across the world.
+
+Often, a command will die because it was parameterized incorrectly.  For 
example, perhaps the
+decoder ran out of memory.  This allows you to adjust the parameter (e.g., 
`--joshua-mem`) and rerun
+the script.  Of course, if you change one of the parameters a step depends on, 
it will trigger a
+rerun, which in turn might trigger further downstream reruns.
+   
+## Skipping steps, quitting early
+
+You will also find it useful to start the pipeline somewhere other than data 
preparation (for
+example, if you have already-processed data and an alignment, and want to 
begin with building a
+grammar) or to end it prematurely (if, say, you don't have a test set and just 
want to tune a
+model).  This can be accomplished with the `--first-step` and `--last-step` 
flags, which take as
+argument a case-insensitive version of the following steps:
+
+- *FIRST*: Data preparation.  Everything begins with data preparation.  This 
is the default first
+   step, so there is no need to be explicit about it.
+
+- *ALIGN*: Alignment.  You might want to start here if you want to skip data 
preprocessing.
+
+- *PARSE*: Parsing.  This is only relevant for building SAMT grammars (`--type 
samt`), in which case
+   the target side (`--target`) of the training data (`--corpus`) is parsed 
before building a
+   grammar.
+
+- *THRAX*: Grammar extraction [with Thrax](thrax.html).  If you jump to this 
step, you'll need to
+   provide an aligned corpus (`--alignment`) along with your parallel data.  
+
+- *TUNE*: Tuning.  The exact tuning method is determined with `--tuner 
{mert,pro}`.  With this
+   option, you need to specify a grammar (`--grammar`) or separate tune 
(`--tune-grammar`) and test
+   (`--test-grammar`) grammars.  A full grammar (`--grammar`) will be filtered 
against the relevant
+   tuning or test set unless you specify `--no-filter-tm`.  If you want a 
language model built from
+   the target side of your training data, you'll also need to pass in the 
training corpus
+   (`--corpus`).  You can also specify an arbitrary number of additional 
language models with one or
+   more `--lmfile` flags.
+
+- *TEST*: Testing.  If you have a tuned model file, you can test new corpora 
by passing in a test
+   corpus with references (`--test`).  You'll need to provide a run name 
(`--name`) to store the
+   results of this run, which will be placed under `test/NAME`.  You'll also 
need to provide a
+   Joshua configuration file (`--joshua-config`), one or more language models 
(`--lmfile`), and a
+   grammar (`--grammar`); this will be filtered to the test data unless you 
specify
+   `--no-filter-tm`) or unless you directly provide a filtered test grammar 
(`--test-grammar`).
+
+- *LAST*: The last step.  This is the default target of `--last-step`.
+
+We now discuss these steps in more detail.
+
+<a name="prep" />
+## 1. DATA PREPARATION
+
+Data prepare involves doing the following to each of the training data 
(`--corpus`), tuning data
+(`--tune`), and testing data (`--test`).  Each of these values is an absolute 
or relative path
+prefix.  To each of these prefixes, a "." is appended, followed by each of 
SOURCE (`--source`) and
+TARGET (`--target`), which are file extensions identifying the languages.  The 
SOURCE and TARGET
+files must have the same number of lines.  
+
+For tuning and test data, multiple references are handled automatically.  A 
single reference will
+have the format TUNE.TARGET, while multiple references will have the format 
TUNE.TARGET.NUM, where
+NUM starts at 0 and increments for as many references as there are.
+
+The following processing steps are applied to each file.
+
+1.  **Copying** the files into `RUNDIR/data/TYPE`, where TYPE is one of 
"train", "tune", or "test".
+    Multiple `--corpora` files are concatenated in the order they are 
specified.  Multiple `--tune`
+    and `--test` flags are not currently allowed.
+    
+1.  **Normalizing** punctuation and text (e.g., removing extra spaces, 
converting special
+    quotations).  There are a few language-specific options that depend on the 
file extension
+    matching the [two-letter ISO 
639-1](http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
+    designation.
+
+1.  **Tokenizing** the data (e.g., separating out punctuation, converting 
brackets).  Again, there
+    are language-specific tokenizations for a few languages (English, German, 
and Greek).
+
+1.  (Training only) **Removing** all parallel sentences with more than 
`--maxlen` tokens on either
+    side.  By default, MAXLEN is 50.  To turn this off, specify `--maxlen 0`.
+
+1.  **Lowercasing**.
+
+This creates a series of intermediate files which are saved for posterity but 
compressed.  For
+example, you might see
+
+    data/
+        train/
+            train.en.gz
+            train.tok.en.gz
+            train.tok.50.en.gz
+            train.tok.50.lc.en
+            corpus.en -> train.tok.50.lc.en
+
+The file "corpus.LANG" is a symbolic link to the last file in the chain.  
+
+<a name="alignment" />
+## 2. ALIGNMENT
+
+Alignments are between the parallel corpora at 
`RUNDIR/data/train/corpus.{SOURCE,TARGET}`.  To
+prevent the alignment tables from getting too big, the parallel corpora are 
grouped into files of no
+more than ALIGNER\_CHUNK\_SIZE blocks (controlled with a parameter below).  
The last block is folded
+into the penultimate block if it is too small.  These chunked files are all 
created in a
+subdirectory of `RUNDIR/data/train/splits`, named `corpus.LANG.0`, 
`corpus.LANG.1`, and so on.
+
+The pipeline parameters affecting alignment are:
+
+-   `aligner ALIGNER` {giza (default), berkeley}
+
+    Which aligner to use.  The default is 
[GIZA++](http://code.google.com/p/giza-pp/), but
+    [the Berkeley aligner](http://code.google.com/p/berkeleyaligner/) can be 
used instead.  When
+    using the Berkeley aligner, you'll want to pay attention to how much 
memory you allocate to it
+    with `--aligner-mem` (the default is 10g).
+
+-   `aligner-chunk-size SIZE` (1,000,000)
+
+    The number of sentence pairs to compute alignments over.
+    
+-   `--alignment FILE`
+
+    If you have an already-computed alignment, you can pass that to the script 
using this flag.
+    Note that, in this case, you will want to skip data preparation and 
alignment using
+    `--first-step thrax` (the first step after alignment) and also to specify 
`--no-prepare-data` so
+    as not to retokenize the data and mess with your alignments.
+    
+    The alignment file format is the standard format where 0-indexed many-many 
alignment pairs for a
+    sentence are provided on a line, source language first, e.g.,
+
+      0-0 0-1 1-2 1-7 ...
+
+    This value is required if you start at the grammar extraction step.
+
+When alignment is complete, the alignment file can be found at 
`RUNDIR/alignments/training.align`.
+It is parallel to the training corpora.  There are many files in the 
`alignments/` subdirectory that
+contain the output of intermediate steps.
+
+<a name="parsing" />
+## 3. PARSING
+
+When SAMT grammars are being built (`--type samt`), the target side of the 
training data must be
+parsed.  The pipeline assumes your target side will be English, and will parse 
it for you using
+[the Berkeley parser](http://code.google.com/p/berkeleyparser/), which is 
included.  If it is not
+the case that English is your target-side language, the target side of your 
training data (found at
+CORPUS.TARGET) must already be parsed in PTB format.  The pipeline will notice 
that it is parsed and
+will not reparse it.
+
+Parsing is affected by both the `--threads N` and `--jobs N` options.  The 
former runs the parser in
+multithreaded mode, while the latter distributes the runs across as cluster 
(and requires some
+configuration, not yet documented).  The options are mutually exclusive.
+
+Once the parsing is complete, there will be two parsed files:
+
+- `RUNDIR/data/train/corpus.en.parsed`: this is the mixed-case file that was 
parsed.
+- `RUNDIR/data/train/corpus.parsed.en`: this is a leaf-lowercased version of 
the above file used for
+  grammar extraction.
+
+<a name="tm" />
+## 4. THRAX (grammar extraction)
+
+The grammar extraction step takes three pieces of data: (1) the 
source-language training corpus, (2)
+the target-language training corpus (parsed, if an SAMT grammar is being 
extracted), and (3) the
+alignment file.  From these, it computes a synchronous context-free grammar.  
If you already have a
+grammar and wish to skip this step, you can do so passing the grammar with the 
`--grammar GRAMMAR`
+flag. 
+
+The main variable in grammar extraction is Hadoop.  If you have a Hadoop 
installation, simply ensure
+that the environment variable `$HADOOP` is defined, and Thrax will seamlessly 
use it.  If you *do
+not* have a Hadoop installation, the pipeline will roll out out for you, 
running Hadoop in
+standalone mode.  (This mode is triggered when `$HADOOP` is undefined).  
Theoretically, any grammar extractable on a full Hadoop cluster should be
+extractable in standalone mode, if you are patient enough; in practice, you 
probably are not patient
+enough, and will be limited to smaller datasets.  Setting up your own Hadoop 
cluster is not too
+difficult a chore; in particular, you may find it helpful to install a
+[pseudo-distributed version of 
Hadoop](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html).
+In our experience, this works fine, but you should note the following caveats:
+
+- It is of crucial importance that you have enough physical disks.  We have 
found that having too
+  few, or too slow of disks, results in a whole host of seemingly unrelated 
issues that are hard to
+  resolve, such as timeouts.  
+- NFS filesystems can exacerbate this.  You should really try to install 
physical disks that are
+  dedicated to Hadoop scratch space.
+
+Here are some flags relevant to Hadoop and grammar extraction with Thrax:
+
+- `--hadoop /path/to/hadoop`
+
+  This sets the location of Hadoop (overriding the environment variable 
`$HADOOP`)
+  
+- `--hadoop-mem MEM` (2g)
+
+  This alters the amount of memory available to Hadoop mappers (passed via the
+  `mapred.child.java.opts` options).
+  
+- `--thrax-conf FILE`
+
+   Use the provided Thrax configuration file instead of the (grammar-specific) 
default.  The Thrax
+   templates are located at 
`$JOSHUA/scripts/training/templates/thrax-TYPE.conf`, where TYPE is one
+   of "hiero" or "samt".
+  
+When the grammar is extracted, it is compressed and placed at 
`RUNDIR/grammar.gz`.
+
+<a name="lm" />
+## 5. Language model
+
+Before tuning can take place, a language model is needed.  A language model is 
always built from the
+target side of the training corpus unless `--no-corpus-lm` is specified.  In 
addition, you can
+provide other language models (any number of them) with the `--lmfile FILE` 
argument.  Other
+arguments are as follows.
+
+-  `--lm` {kenlm (default), berkeleylm}
+
+   This determines the language model code that will be used when decoding.  
These implementations
+   are described in their respective papers (PDFs:
+   [KenLM](http://kheafield.com/professional/avenue/kenlm.pdf),
+   
[BerkeleyLM](http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf)).
+   
+- `--lmfile FILE`
+
+  Specifies a pre-built language model to use when decoding.  This language 
model can be in ARPA
+  format, or in KenLM format when using KenLM or BerkeleyLM format when using 
that format.
+
+- `--lm-gen` {berkeleylm (default), srilm}, `--buildlm-mem MEM`, 
`--witten-bell`
+
+  At the tuning step, an LM is built from the target side of the training data 
(unless
+  `--no-corpus-lm` is specified).  This controls which code is used to build 
it.  The default is a
+  [BerkeleyLM java 
class](http://code.google.com/p/berkeleylm/source/browse/trunk/src/edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText.java)
+  that computes a Kneser-Ney LM with a constant discounting and no count 
thresholding.  The flag
+  `--buildlm-mem` can be used to control how much memory is allocated to the 
Java process.  The
+  default is "2g", but you will want to increase it for larger language models.
+  
+  If SRILM is used, it is called with the following arguments:
+  
+        $SRILM/bin/i686-m64/ngram-count -interpolate SMOOTHING -order 5 -text 
TRAINING-DATA -unk -lm lm.gz
+        
+  Where SMOOTHING is `-kndiscount`, or `-wbdiscount` if `--witten-bell` is 
passed to the pipeline.
+  
+A language model built from the target side of the training data is placed at 
`RUNDIR/lm.gz`.  
+
+
+## Interlude: decoder arguments
+
+Running the decoder is done in both the tuning stage and the testing stage.  A 
critical point is
+that you have to give the decoder enough memory to run.  Joshua can be very 
memory-intensive, in
+particular when decoding with large grammars and large language models.  The 
default amount of
+memory is 3100m, which is likely not enough (especially if you are decoding 
with SAMT grammar).  You
+can alter the amount of memory for Joshua using the `--joshua-mem MEM` 
argument, where MEM is a Java
+memory specification (passed to its `-Xmx` flag).
+
+<a name="tuning" />
+## 6. TUNING
+
+Two optimizers are implemented for Joshua: MERT and PRO (`--tuner 
{mert,pro}`).  Tuning is run till
+convergence in the `RUNDIR/tune` directory.  By default, tuning is run just 
once, but the pipeline
+supports running the optimizer an arbitrary number of times due to
+[recent work](http://www.youtube.com/watch?v=BOa3XDkgf0Y) pointing out the 
variance of tuning
+procedures in machine translation, in particular MERT.  This can be activated 
with `--optimizer-runs
+N`.  Each run can be found in a directory `RUNDIR/tune/N`.
+
+When
+tuning is finished, each final configuration file can be found at either
+
+    RUNDIR/tune/N/joshua.config.ZMERT.final
+    RUNDIR/tune/N/joshua.config.PRO.final
+
+where N varies from 1..`--optimizer-runs`.
+
+<a name="testing" />
+## 7. Testing
+
+For each of the tuner runs, Joshua takes the tuner output file and decodes the 
test set.
+Afterwards, by default, minimum Bayes-risk decoding is run on the 300-best 
output.  This step
+usually yields about 0.3 - 0.5 BLEU points but is time-consuming, and can be 
turned off with the
+`--no-mbr` flag. 
+
+After decoding the test set with each set of tuned weights, Joshua computes 
the mean BLEU score,
+writes it to `RUNDIR/test/final-bleu`, and cats it.  That's the end of the 
pipeline!
+
+Joshua also supports decoding further test sets.  This is enabled by rerunning 
the pipeline with a
+number of arguments:
+
+-   `--first-step TEST`
+
+    This tells the decoder to start at the test step.
+
+-   `--name NAME`
+
+    A name is needed to distinguish this test set from the previous ones.  
Output for this test run
+    will be stored at `RUNDIR/test/NAME`.
+    
+-   `--joshua-config CONFIG`
+
+    A tuned parameter file is required.  This file will be the output of some 
prior tuning run.
+    Necessary pathnames and so on will be adjusted.
+
+## COMMON USE CASES AND PITFALLS 
+
+- If the pipeline dies at the "thrax-run" stage with an error like the 
following:
+
+      JOB FAILED (return code 1) 
+      hadoop/bin/hadoop: line 47: 
+      /some/path/to/a/directory/hadoop/bin/hadoop-config.sh: No such file or 
directory 
+      Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/hadoop/fs/FsShell 
+      Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.fs.FsShell 
+      
+  This occurs if the `$HADOOP` environment variable is set but does not point 
to a working
+  Hadoop installation.  To fix it, make sure to unset the variable:
+  
+      # in bash
+      unset HADOOP
+      
+  and then rerun the pipeline with the same invocation.
+
+- Memory usage is a major consideration in decoding with Joshua and 
hierarchical grammars.  In
+  particular, SAMT grammars often require a large amount of memory.  Many 
steps have been taken to
+  reduce memory usage, including beam settings and test-set- and 
sentence-level filtering of
+  grammars.  However, memory usage can still be in the tens of gigabytes.
+
+  To accommodate this kind of variation, the pipeline script allows you to 
specify both (a) the
+  amount of memory used by the Joshua decoder instance and (b) the amount of 
memory required of
+  nodes obtained by the qsub command.  These are accomplished with the 
`--joshua-mem` MEM and
+  `--qsub-args` ARGS commands.  For example,
+
+      pipeline.pl --joshua-mem 32g --qsub-args "-l pvmem=32g -q himem.q" ...
+
+  Also, should Thrax fail, it might be due to a memory restriction. By 
default, Thrax requests 2 GB
+  from the Hadoop server. If more memory is needed, set the memory requirement 
with the
+  `--hadoop-mem` in the same way as the `--joshua-mem` option is used.
+
+- Other pitfalls and advice will be added as it is discovered.
+
+## FEEDBACK 
+
+Please email [email protected] with problems or suggestions.
+

[18/18] incubator-joshua-site git commit: Initial import of joshua-decoder.github.com site to Apache

Reply via email to