http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/decoder.html ---------------------------------------------------------------------- diff --git a/5.0/decoder.html b/5.0/decoder.html new file mode 100644 index 0000000..866f13d --- /dev/null +++ b/5.0/decoder.html @@ -0,0 +1,637 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <title>Joshua Documentation | Decoder configuration parameters</title> + <meta name="viewport" content="width=device-width, initial-scale=1.0"> + <meta name="description" content=""> + <meta name="author" content=""> + + <!-- Le styles --> + <link href="/bootstrap/css/bootstrap.css" rel="stylesheet"> + <style> + body { + padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */ + } + #download { + background-color: green; + font-size: 14pt; + font-weight: bold; + text-align: center; + color: white; + border-radius: 5px; + padding: 4px; + } + + #download a:link { + color: white; + } + + #download a:hover { + color: lightgrey; + } + + #download a:visited { + color: white; + } + + a.pdf { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: brown; + padding: 2px; + } + + a.bibtex { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: orange; + padding: 2px; + } + + img.sponsor { + height: 120px; + margin: 5px; + } + </style> + <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet"> + + <!-- HTML5 shim, for IE6-8 support of HTML5 elements --> + <!--[if lt IE 9]> + <script src="bootstrap/js/html5shiv.js"></script> + <![endif]--> + + <!-- Fav and touch icons --> + <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png"> + <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png"> + <link rel="shortcut icon" href="bootstrap/ico/favicon.png"> + </head> + + <body> + + <div class="navbar navbar-inverse navbar-fixed-top"> + <div class="navbar-inner"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="brand" href="/">Joshua</a> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="index.html">Documentation</a></li> + <li><a href="pipeline.html">Pipeline</a></li> + <li><a href="tutorial.html">Tutorial</a></li> + <li><a href="decoder.html">Decoder</a></li> + <li><a href="thrax.html">Thrax</a></li> + <li><a href="file-formats.html">File formats</a></li> + <!-- <li><a href="advanced.html">Advanced</a></li> --> + <li><a href="faq.html">FAQ</a></li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + + <div class="container"> + + <div class="row"> + <div class="span2"> + <img src="/images/joshua-logo-small.png" + alt="Joshua logo (picture of a Joshua tree)" /> + </div> + <div class="span10"> + <h1>Joshua Documentation</h1> + <h2>Decoder configuration parameters</h2> + <span id="download"> + <a href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz">Download</a> + </span> + (version 5.0, released 16 August 2013) + </div> + </div> + + <hr /> + + <div class="row"> + <div class="span8"> + + <p>Joshua configuration parameters affect the runtime behavior of the decoder itself. This page +describes the complete list of these parameters and describes how to invoke the decoder manually.</p> + +<p>To run the decoder, a convenience script is provided that loads the necessary Java libraries. +Assuming you have set the environment variable <code class="highlighter-rouge">$JOSHUA</code> to point to the root of your installation, +its syntax is:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/decoder [-m memory-amount] [-c config-file other-joshua-options ...] +</code></pre> +</div> + +<p>The <code class="highlighter-rouge">-m</code> argument, if present, must come first, and the memory specification is in Java format +(e.g., 400m, 4g, 50g). Most notably, the suffixes âmâ and âgâ are used for âmegabytesâ and +âgigabytesâ, and there cannot be a space between the number and the unit. The value of this +argument is passed to Java itself in the invocation of the decoder, and the remaining options are +passed to Joshua. The <code class="highlighter-rouge">-c</code> parameter has special import because it specifies the location of the +configuration file.</p> + +<p>The Joshua decoder works by reading from STDIN and printing translations to STDOUT as they are +received, according to a number of <a href="#output">output options</a>. If no run-time parameters are +specified (e.g., no translation model), sentences are simply pushed through untranslated. Blank +lines are similarly pushed through as blank lines, so as to maintain parallelism with the input.</p> + +<p>Parameters can be provided to Joshua via a configuration file and from the command +line. Command-line arguments override values found in the configuration file. The format for +configuration file parameters is</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>parameter = value +</code></pre> +</div> + +<p>Command-line options are specified in the following format</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>-parameter value +</code></pre> +</div> + +<p>Values are one of four types (which we list here mostly to call attention to the boolean format):</p> + +<ul> + <li>STRING, an arbitrary string (no spaces)</li> + <li>FLOAT, a floating-point value</li> + <li>INT, an integer</li> + <li> + <p>BOOLEAN, a boolean value. For booleans, <code class="highlighter-rouge">true</code> evaluates to true, and all other values evaluate +to false. For command-line options, the value may be omitted, in which case it evaluates to +true. For example, the following are equivalent:</p> + + <div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/decoder -mark-oovs true +$JOSHUA/bin/decoder -mark-oovs +</code></pre> + </div> + </li> +</ul> + +<h2 id="joshua-configuration-file">Joshua configuration file</h2> + +<p>In addition to the decoder parameters described below, the configuration file contains the model +feature weights. These weights are distinguished from runtime parameters in that they are delimited +by a space instead of an equals sign. They take the following +format, and by convention are placed at the end of the configuration file:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>lm_0 4.23 +tm_pt_0 -0.2 +OOVPenalty -100 +</code></pre> +</div> + +<p>Joshua can make use of thousands of features, which are described in further detail in the +<a href="features.html">feature file</a>.</p> + +<h2 id="joshua-decoder-parameters">Joshua decoder parameters</h2> + +<p>This section contains a list of the Joshua run-time parameters. An important note about the +parameters is that they are collapsed to canonical form, in which dashes (-) and underscores (-) are +removed and case is converted to lowercase. For example, the following parameter forms are +equivalent (either in the configuration file or from the command line):</p> + +<div class="highlighter-rouge"><pre class="highlight"><code><span class="p">{</span><span class="err">top-n,</span><span class="w"> </span><span class="err">topN,</span><span class="w"> </span><span class="err">top_n,</span><span class="w"> </span><span class="err">TOP_N,</span><span class="w"> </span><span class="err">t-o-p-N</span><span class="p">}</span><span class="w"> +</span><span class="p">{</span><span class="err">poplimit,</span><span class="w"> </span><span class="err">pop-limit,</span><span class="w"> </span><span class="err">pop-limit,</span><span class="w"> </span><span class="err">popLimit,PoPlImIt</span><span class="p">}</span><span class="w"> +</span></code></pre> +</div> + +<p>This basically defines equivalence classes of parameters, and relieves you of the task of having to +remember the exact format of each parameter.</p> + +<p>In what follows, we group the configuration parameters in the following groups:</p> + +<ul> + <li><a href="#general">General options</a></li> + <li><a href="#pruning">Pruning</a></li> + <li><a href="#tm">Translation model options</a></li> + <li><a href="#lm">Language model options</a></li> + <li><a href="#output">Output options</a></li> + <li><a href="#modes">Alternate modes of operation</a></li> +</ul> + +<p><a id="general"></a></p> + +<h3 id="general-decoder-options">General decoder options</h3> + +<ul> + <li> + <p><code class="highlighter-rouge">c</code>, <code class="highlighter-rouge">config</code> â <em>NULL</em></p> + + <p>Specifies the configuration file from which Joshua options are loaded. This feature is unique in + that it must be specified from the command line (obviously).</p> + </li> + <li> + <p><code class="highlighter-rouge">amortize</code> â <em>true</em></p> + + <p>When true, specifies that sorting of the rule lists at each trie node in the grammar should be +delayed until the trie node is accessed. When false, all such nodes are sorted before decoding +even begins. Setting to true results in slower per-sentence decoding, but allows the decoder to +begin translating almost immediately (especially with large grammars).</p> + </li> + <li> + <p><code class="highlighter-rouge">server-port</code> â <em>0</em></p> + + <p>If set to a nonzero value, Joshua will start a multithreaded TCP/IP server on the specified +port. Clients can connect to it directly through programming APIs or command-line tools like +<code class="highlighter-rouge">telnet</code> or <code class="highlighter-rouge">nc</code>.</p> + + <div class="highlighter-rouge"><pre class="highlight"><code>$ $JOSHUA/bin/decoder -m 30g -c /path/to/config/file -server-port 8723 +... +$ cat input.txt | nc localhost 8723 > results.txt +</code></pre> + </div> + </li> + <li> + <p><code class="highlighter-rouge">maxlen</code> â <em>200</em></p> + + <p>Input sentences longer than this are truncated.</p> + </li> + <li> + <p><code class="highlighter-rouge">feature-function</code></p> + + <p>Enables a particular feature function. See the <a href="features.html">feature function page</a> for more information.</p> + </li> + <li> + <p><code class="highlighter-rouge">oracle-file</code> â <em>NULL</em></p> + + <p>The location of a set of oracle reference translations, parallel to the input. When present, +after producing the hypergraph by decoding the input sentence, the oracle is used to rescore the +translation forest with a BLEU approximation in order to extract the oracle-translation from the +forest. This is useful for obtaining an (approximation to an) upper bound on your translation +model under particular search settings.</p> + </li> + <li> + <p><code class="highlighter-rouge">default-nonterminal</code> â <em>âXâ</em></p> + + <p>This is the nonterminal symbol assigned to out-of-vocabulary (OOV) items. Joshua assigns this + label to every word of the input, in fact, so that even known words can be translated as OOVs, if + the model prefers them. Usually, a very low weight on the <code class="highlighter-rouge">OOVPenalty</code> feature discourages their + use unless necessary.</p> + </li> + <li> + <p><code class="highlighter-rouge">goal-symbol</code> â <em>âGOALâ</em></p> + + <p>This is the symbol whose presence in the chart over the whole input span denotes a successful + parse (translation). It should match the LHS nonterminal in your glue grammar. Internally, + Joshua represents nonterminals enclosed in square brackets (e.g., â[GOAL]â), which you can + optionally supply in the configuration file.</p> + </li> + <li> + <p><code class="highlighter-rouge">true-oovs-only</code> â <em>false</em></p> + + <p>By default, Joshua creates an OOV entry for every word in the source sentence, regardless of +whether it is found in the grammar. This allows every word to be pushed through untranslated +(although potentially incurring a high cost based on the <code class="highlighter-rouge">OOVPenalty</code> feature). If this option is +set, then only true OOVs are entered into the chart as OOVs. To determine âtrueâ OOVs, Joshua +examines the first level of the grammar trie for each word of the input (this isnât a perfect +heuristic, since a word could be present only in deeper levels of the trie).</p> + </li> + <li> + <p><code class="highlighter-rouge">threads</code>, <code class="highlighter-rouge">num-parallel-decoders</code> â <em>1</em></p> + + <p>This determines how many simultaneous decoding threads to launch. </p> + + <p>Outputs are assembled in order and Joshua has to hold on to the complete target hypergraph until +it is ready to be processed for output, so too many simultaneous threads could result in lots of +memory usage if a long sentence results in many sentences being queued up. We have run Joshua +with as many as 64 threads without any problems of this kind, but itâs useful to keep in the back +of your mind.</p> + </li> + <li> + <p><code class="highlighter-rouge">weights-file</code> â NULL</p> + + <p>Weights are appended to the end of the Joshua configuration file, by convention. If you prefer to +put them in a separate file, you can do so, and point to the file with this parameter.</p> + </li> +</ul> + +<h3 id="pruning-options-a-idpruning-">Pruning options <a id="pruning"></a></h3> + +<ul> + <li> + <p><code class="highlighter-rouge">pop-limit</code> â <em>100</em></p> + + <p>The number of cube-pruning hypotheses that are popped from the candidates list for each span of +the input. Higher values result in a larger portion of the search space being explored at the +cost of an increased search time. For exhaustive search, set <code class="highlighter-rouge">pop-limit</code> to 0.</p> + </li> + <li> + <p><code class="highlighter-rouge">filter-grammar</code> â false</p> + + <p>Set to true, this enables dynamic sentence-level filtering. For each sentence, each grammar is +filtered at runtime down to rules that can be applied to the sentence under consideration. This +takes some time (which we havenât thoroughly quantified), but can result in the removal of many +rules that are only partially applicable to the sentence.</p> + </li> + <li><code class="highlighter-rouge">constrain-parse</code> â <em>false</em></li> + <li> + <p><code class="highlighter-rouge">use_pos_labels</code> â <em>false</em></p> + + <p><em>These features are not documented.</em></p> + </li> +</ul> + +<h3 id="translation-model-options-a-idtm-">Translation model options <a id="tm"></a></h3> + +<p>Joshua supports any number of translation models. Conventionally, two are supplied: the main grammar +containing translation rules, and the glue grammar for patching things together. Internally, Joshua +doesnât distinguish between the roles of these grammars; they are treated differently only in that +they typically have different span limits (the maximum input width they can be applied to).</p> + +<p>Grammars are instantiated with config file lines of the following form:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>tm = TYPE OWNER SPAN_LIMIT FILE +</code></pre> +</div> + +<ul> + <li><code class="highlighter-rouge">TYPE</code> is the grammar type, which must be set to âthraxâ. </li> + <li><code class="highlighter-rouge">OWNER</code> is the grammarâs owner, which defines the set of <a href="features.html">feature weights</a> that +apply to the weights found in each line of the grammar (using different owners allows each grammar +to have different sets and numbers of weights, while sharing owners allows weights to be shared +across grammars).</li> + <li><code class="highlighter-rouge">SPAN_LIMIT</code> is the maximum span of the input that rules from this grammar can be applied to. A +span limit of 0 means âno limitâ, while a span limit of -1 means that rules from this grammar must +be anchored to the left side of the sentence (index 0).</li> + <li><code class="highlighter-rouge">FILE</code> is the path to the file containing the grammar. If the file is a directory, it is assumed +to be <a href="packed.html">packed</a>. Only one packed grammar can currently be used at a time.</li> +</ul> + +<p>For reference, the following two translation model lines are used by the <a href="pipeline.html">pipeline</a>:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>tm = thrax pt 20 /path/to/packed/grammar +tm = thrax glue -1 /path/to/glue/grammar +</code></pre> +</div> + +<h3 id="language-model-options-a-idlm-">Language model options <a id="lm"></a></h3> + +<p>Joshua supports any number of language models. To add a language +model, add a line of the following format to the configuration file:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>lm = TYPE ORDER LEFT_STATE RIGHT_STATE CEILING_COST FILE +</code></pre> +</div> + +<p>where the six fields correspond to the following values:</p> + +<ul> + <li><code class="highlighter-rouge">TYPE</code>: one of âkenlmâ, âberkeleylmâ, or ânoneâ</li> + <li><code class="highlighter-rouge">ORDER</code>: the order of the language model</li> + <li><code class="highlighter-rouge">LEFT_STATE</code>: whether to use left-state minimization; currently only supported by KenLM</li> + <li><code class="highlighter-rouge">RIGHT_STATE</code>: whether to use right equivalent state (currently unsupported)</li> + <li><code class="highlighter-rouge">CEILING_COST</code>: the LM-specific ceiling cost of all n-grams (currently ignored)</li> + <li><code class="highlighter-rouge">FILE</code>: the path to the language model file. All language model types support the standard ARPA + format. Additionally, if the LM type is âkenlmâ, this file can be compiled into KenLMâs compiled + format (using the program at <code class="highlighter-rouge">$JOSHUA/bin/build_binary</code>); if the the LM type is âberkeleylmâ, it + can be compiled by following the directions in + <code class="highlighter-rouge">$JOSHUA/src/joshua/decoder/ff/lm/berkeley_lm/README</code>. The <a href="pipeline.html">pipeline</a> will + automatically compile either type.</li> +</ul> + +<p>For each language model, you need to specify a feature weight in the following format:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>lm_0 WEIGHT +lm_1 WEIGHT +... +</code></pre> +</div> + +<p>where the indices correspond to the order of the language model declaration lines.</p> + +<h3 id="output-options-a-idoutput-">Output options <a id="output"></a></h3> + +<ul> + <li> + <p><code class="highlighter-rouge">output-format</code> <em>New in 5.0</em></p> + + <p>Joshua prints a lot of information to STDERR (making this more granular is on the TODO +list). Output to STDOUT is reserved for decoder translations, and is controlled by the</p> + + <ul> + <li> + <p><code class="highlighter-rouge">%i</code>: the sentence number (0-indexed)</p> + </li> + <li> + <p><code class="highlighter-rouge">%e</code>: the source sentence</p> + </li> + <li> + <p><code class="highlighter-rouge">%s</code>: the translated sentence</p> + </li> + <li> + <p><code class="highlighter-rouge">%S</code>: the translated sentence, with some basic capitalization and denomralization. e.g.,</p> + + <div class="highlighter-rouge"><pre class="highlight"><code>$ echo "¿ who you lookin' at , mr. ?" | $JOSHUA/bin/decoder -output-format "%S" -mark-oovs false 2> /dev/null +¿Who you lookin' at, Mr.? +</code></pre> + </div> + </li> + <li> + <p><code class="highlighter-rouge">%t</code>: the synchronous derivation</p> + </li> + <li> + <p><code class="highlighter-rouge">%f</code>: the list of feature values (as name=value pairs)</p> + </li> + <li> + <p><code class="highlighter-rouge">%c</code>: the model cost</p> + </li> + <li> + <p><code class="highlighter-rouge">%w</code>: the weight vector (unimplemented)</p> + </li> + <li> + <p><code class="highlighter-rouge">%a</code>: the alignments between source and target words (unimplemented)</p> + </li> + </ul> + + <p>The default value is</p> + + <div class="highlighter-rouge"><pre class="highlight"><code>output-format = %i ||| %s ||| %f ||| %c +</code></pre> + </div> + + <p>i.e.,</p> + + <div class="highlighter-rouge"><pre class="highlight"><code>input ID ||| translation ||| model scores ||| score +</code></pre> + </div> + </li> + <li> + <p><code class="highlighter-rouge">top-n</code> â <em>300</em></p> + + <p>The number of translation hypotheses to output, sorted in decreasing order of model score</p> + </li> + <li> + <p><code class="highlighter-rouge">use-unique-nbest</code> â <em>true</em></p> + + <p>When constructing the n-best list for a sentence, skip hypotheses whose string has already been +output.</p> + </li> + <li> + <p><code class="highlighter-rouge">escape-trees</code> â <em>false</em></p> + </li> + <li> + <p><code class="highlighter-rouge">include-align-index</code> â <em>false</em></p> + + <p>Output the source words indices that each target word aligns to.</p> + </li> + <li> + <p><code class="highlighter-rouge">mark-oovs</code> â <em>false</em></p> + + <p>if <code class="highlighter-rouge">true</code>, this causes the text â_OOVâ to be appended to each untranslated word in the output.</p> + </li> + <li> + <p><code class="highlighter-rouge">visualize-hypergraph</code> â <em>false</em></p> + + <p>If set to true, a visualization of the hypergraph will be displayed, though you will have to +explicitly include the relevant jar files. See the example usage in +<code class="highlighter-rouge">$JOSHUA/examples/tree_visualizer/</code>, which contains a demonstration of a source sentence, +translation, and synchronous derivation.</p> + </li> + <li> + <p><code class="highlighter-rouge">dump-hypergraph</code> â ââ</p> + + <p>This feature directs that the hypergraph should be written to disk for each input sentence. If +set, the value should contain the string â%dâ, which is replaced with the sentence number. For +example,</p> + + <div class="highlighter-rouge"><pre class="highlight"><code>cat input.txt | $JOSHUA/bin/decoder -dump-hypergraph hgs/%d.txt +</code></pre> + </div> + + <p>Note that the output directory must exist.</p> + + <p>TODO: revive the +<a href="http://aclweb.org/aclwiki/index.php?title=Hypergraph_Format">discussion on a common hypergraph format</a> +on the ACL Wiki and support that format.</p> + </li> +</ul> + +<h3 id="lattice-decoding">Lattice decoding</h3> + +<p>In addition to regular sentences, Joshua can decode weighted lattices encoded in +<a href="http://www.statmt.org/moses/?n=Moses.WordLattices">the PLF format</a>, except that path costs should +be listed as <b>log probabilities</b> instead of probabilities. Lattice decoding was originally +added by Lane Schwartz and <a href="http://www.cs.cmu.edu/~cdyer/">Chris Dyer</a>.</p> + +<p>Joshua will automatically detect whether the input sentence is a regular sentence (the usual case) +or a lattice. If a lattice, a feature will be activated that accumulates the cost of different +paths through the lattice. In this case, you need to ensure that a weight for this feature is +present in <a href="decoder.html">your model file</a>. The <a href="pipeline.html">pipeline</a> will handle this +automatically, or if you are doing this manually, you can add the line</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>SourcePath COST +</code></pre> +</div> + +<p>to your Joshua configuration file. </p> + +<p>Lattices must be listed one per line.</p> + +<h3 id="alternate-modes-of-operation-a-idmodes-">Alternate modes of operation <a id="modes"></a></h3> + +<p>In addition to decoding input sentences in the standard way, Joshua supports both <em>constrained +decoding</em> and <em>synchronous parsing</em>. In both settings, both the source and target sides are provided +as input, and the decoder finds a derivation between them.</p> + +<h4 id="constrained-decoding">Constrained decoding</h4> + +<p>To enable constrained decoding, simply append the desired target string as part of the input, in +the following format:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>source sentence ||| target sentence +</code></pre> +</div> + +<p>Joshua will translate the source sentence constrained to the target sentence. There are a few +caveats:</p> + +<ul> + <li> + <p>Left-state minimization cannot be enabled for the language model</p> + </li> + <li> + <p>A heuristic is used to constrain the derivation (the LM state must match against the +input). This is not a perfect heuristic, and sometimes results in analyses that are not +perfectly constrained to the input, but have extra words.</p> + </li> +</ul> + +<h4 id="synchronous-parsing">Synchronous parsing</h4> + +<p>Joshua supports synchronous parsing as a two-step sequence of monolingual parses, as described in +Dyer (NAACL 2010) (<a href="http://www.aclweb.org/anthology/N10-1033â.pdf">PDF</a>). To enable this:</p> + +<ul> + <li> + <p>Set the configuration parameter <code class="highlighter-rouge">parse = true</code>.</p> + </li> + <li> + <p>Remove all language models from the input file </p> + </li> + <li> + <p>Provide input in the following format:</p> + + <div class="highlighter-rouge"><pre class="highlight"><code> source sentence ||| target sentence +</code></pre> + </div> + </li> +</ul> + +<p>You may also wish to display the synchronouse parse tree (<code class="highlighter-rouge">-output-format %t</code>) and the alignment +(<code class="highlighter-rouge">-show-align-index</code>).</p> + + + + </div> + </div> + </div> <!-- /container --> + + <!-- Le javascript + ================================================== --> + <!-- Placed at the end of the document so the pages load faster --> + <script src="bootstrap/js/jquery.js"></script> + <script src="bootstrap/js/bootstrap-transition.js"></script> + <script src="bootstrap/js/bootstrap-alert.js"></script> + <script src="bootstrap/js/bootstrap-modal.js"></script> + <script src="bootstrap/js/bootstrap-dropdown.js"></script> + <script src="bootstrap/js/bootstrap-scrollspy.js"></script> + <script src="bootstrap/js/bootstrap-tab.js"></script> + <script src="bootstrap/js/bootstrap-tooltip.js"></script> + <script src="bootstrap/js/bootstrap-popover.js"></script> + <script src="bootstrap/js/bootstrap-button.js"></script> + <script src="bootstrap/js/bootstrap-collapse.js"></script> + <script src="bootstrap/js/bootstrap-carousel.js"></script> + <script src="bootstrap/js/bootstrap-typeahead.js"></script> + + <!-- Start of StatCounter Code for Default Guide --> + <script type="text/javascript"> + var sc_project=8264132; + var sc_invisible=1; + var sc_security="4b97fe2d"; + </script> + <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script> + <noscript> + <div class="statcounter"> + <a title="hit counter joomla" + href="http://statcounter.com/joomla/" + target="_blank"> + <img class="statcounter" + src="http://c.statcounter.com/8264132/0/4b97fe2d/1/" + alt="hit counter joomla" /> + </a> + </div> + </noscript> + <!-- End of StatCounter Code for Default Guide --> + + </body> +</html>
http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/decoder.md ---------------------------------------------------------------------- diff --git a/5.0/decoder.md b/5.0/decoder.md deleted file mode 100644 index b78cead..0000000 --- a/5.0/decoder.md +++ /dev/null @@ -1,374 +0,0 @@ ---- -layout: default -category: links -title: Decoder configuration parameters ---- - -Joshua configuration parameters affect the runtime behavior of the decoder itself. This page -describes the complete list of these parameters and describes how to invoke the decoder manually. - -To run the decoder, a convenience script is provided that loads the necessary Java libraries. -Assuming you have set the environment variable `$JOSHUA` to point to the root of your installation, -its syntax is: - - $JOSHUA/bin/decoder [-m memory-amount] [-c config-file other-joshua-options ...] - -The `-m` argument, if present, must come first, and the memory specification is in Java format -(e.g., 400m, 4g, 50g). Most notably, the suffixes "m" and "g" are used for "megabytes" and -"gigabytes", and there cannot be a space between the number and the unit. The value of this -argument is passed to Java itself in the invocation of the decoder, and the remaining options are -passed to Joshua. The `-c` parameter has special import because it specifies the location of the -configuration file. - -The Joshua decoder works by reading from STDIN and printing translations to STDOUT as they are -received, according to a number of [output options](#output). If no run-time parameters are -specified (e.g., no translation model), sentences are simply pushed through untranslated. Blank -lines are similarly pushed through as blank lines, so as to maintain parallelism with the input. - -Parameters can be provided to Joshua via a configuration file and from the command -line. Command-line arguments override values found in the configuration file. The format for -configuration file parameters is - - parameter = value - -Command-line options are specified in the following format - - -parameter value - -Values are one of four types (which we list here mostly to call attention to the boolean format): - -- STRING, an arbitrary string (no spaces) -- FLOAT, a floating-point value -- INT, an integer -- BOOLEAN, a boolean value. For booleans, `true` evaluates to true, and all other values evaluate - to false. For command-line options, the value may be omitted, in which case it evaluates to - true. For example, the following are equivalent: - - $JOSHUA/bin/decoder -mark-oovs true - $JOSHUA/bin/decoder -mark-oovs - -## Joshua configuration file - -In addition to the decoder parameters described below, the configuration file contains the model -feature weights. These weights are distinguished from runtime parameters in that they are delimited -by a space instead of an equals sign. They take the following -format, and by convention are placed at the end of the configuration file: - - lm_0 4.23 - tm_pt_0 -0.2 - OOVPenalty -100 - -Joshua can make use of thousands of features, which are described in further detail in the -[feature file](features.html). - -## Joshua decoder parameters - -This section contains a list of the Joshua run-time parameters. An important note about the -parameters is that they are collapsed to canonical form, in which dashes (-) and underscores (-) are -removed and case is converted to lowercase. For example, the following parameter forms are -equivalent (either in the configuration file or from the command line): - - {top-n, topN, top_n, TOP_N, t-o-p-N} - {poplimit, pop-limit, pop-limit, popLimit,PoPlImIt} - -This basically defines equivalence classes of parameters, and relieves you of the task of having to -remember the exact format of each parameter. - -In what follows, we group the configuration parameters in the following groups: - -- [General options](#general) -- [Pruning](#pruning) -- [Translation model options](#tm) -- [Language model options](#lm) -- [Output options](#output) -- [Alternate modes of operation](#modes) - -<a id="general" /> - -### General decoder options - -- `c`, `config` --- *NULL* - - Specifies the configuration file from which Joshua options are loaded. This feature is unique in - that it must be specified from the command line (obviously). - -- `amortize` --- *true* - - When true, specifies that sorting of the rule lists at each trie node in the grammar should be - delayed until the trie node is accessed. When false, all such nodes are sorted before decoding - even begins. Setting to true results in slower per-sentence decoding, but allows the decoder to - begin translating almost immediately (especially with large grammars). - -- `server-port` --- *0* - - If set to a nonzero value, Joshua will start a multithreaded TCP/IP server on the specified - port. Clients can connect to it directly through programming APIs or command-line tools like - `telnet` or `nc`. - - $ $JOSHUA/bin/decoder -m 30g -c /path/to/config/file -server-port 8723 - ... - $ cat input.txt | nc localhost 8723 > results.txt - -- `maxlen` --- *200* - - Input sentences longer than this are truncated. - -- `feature-function` - - Enables a particular feature function. See the [feature function page](features.html) for more information. - -- `oracle-file` --- *NULL* - - The location of a set of oracle reference translations, parallel to the input. When present, - after producing the hypergraph by decoding the input sentence, the oracle is used to rescore the - translation forest with a BLEU approximation in order to extract the oracle-translation from the - forest. This is useful for obtaining an (approximation to an) upper bound on your translation - model under particular search settings. - -- `default-nonterminal` --- *"X"* - - This is the nonterminal symbol assigned to out-of-vocabulary (OOV) items. Joshua assigns this - label to every word of the input, in fact, so that even known words can be translated as OOVs, if - the model prefers them. Usually, a very low weight on the `OOVPenalty` feature discourages their - use unless necessary. - -- `goal-symbol` --- *"GOAL"* - - This is the symbol whose presence in the chart over the whole input span denotes a successful - parse (translation). It should match the LHS nonterminal in your glue grammar. Internally, - Joshua represents nonterminals enclosed in square brackets (e.g., "[GOAL]"), which you can - optionally supply in the configuration file. - -- `true-oovs-only` --- *false* - - By default, Joshua creates an OOV entry for every word in the source sentence, regardless of - whether it is found in the grammar. This allows every word to be pushed through untranslated - (although potentially incurring a high cost based on the `OOVPenalty` feature). If this option is - set, then only true OOVs are entered into the chart as OOVs. To determine "true" OOVs, Joshua - examines the first level of the grammar trie for each word of the input (this isn't a perfect - heuristic, since a word could be present only in deeper levels of the trie). - -- `threads`, `num-parallel-decoders` --- *1* - - This determines how many simultaneous decoding threads to launch. - - Outputs are assembled in order and Joshua has to hold on to the complete target hypergraph until - it is ready to be processed for output, so too many simultaneous threads could result in lots of - memory usage if a long sentence results in many sentences being queued up. We have run Joshua - with as many as 64 threads without any problems of this kind, but it's useful to keep in the back - of your mind. - -- `weights-file` --- NULL - - Weights are appended to the end of the Joshua configuration file, by convention. If you prefer to - put them in a separate file, you can do so, and point to the file with this parameter. - -### Pruning options <a id="pruning" /> - -- `pop-limit` --- *100* - - The number of cube-pruning hypotheses that are popped from the candidates list for each span of - the input. Higher values result in a larger portion of the search space being explored at the - cost of an increased search time. For exhaustive search, set `pop-limit` to 0. - -- `filter-grammar` --- false - - Set to true, this enables dynamic sentence-level filtering. For each sentence, each grammar is - filtered at runtime down to rules that can be applied to the sentence under consideration. This - takes some time (which we haven't thoroughly quantified), but can result in the removal of many - rules that are only partially applicable to the sentence. - -- `constrain-parse` --- *false* -- `use_pos_labels` --- *false* - - *These features are not documented.* - -### Translation model options <a id="tm" /> - -Joshua supports any number of translation models. Conventionally, two are supplied: the main grammar -containing translation rules, and the glue grammar for patching things together. Internally, Joshua -doesn't distinguish between the roles of these grammars; they are treated differently only in that -they typically have different span limits (the maximum input width they can be applied to). - -Grammars are instantiated with config file lines of the following form: - - tm = TYPE OWNER SPAN_LIMIT FILE - -* `TYPE` is the grammar type, which must be set to "thrax". -* `OWNER` is the grammar's owner, which defines the set of [feature weights](features.html) that - apply to the weights found in each line of the grammar (using different owners allows each grammar - to have different sets and numbers of weights, while sharing owners allows weights to be shared - across grammars). -* `SPAN_LIMIT` is the maximum span of the input that rules from this grammar can be applied to. A - span limit of 0 means "no limit", while a span limit of -1 means that rules from this grammar must - be anchored to the left side of the sentence (index 0). -* `FILE` is the path to the file containing the grammar. If the file is a directory, it is assumed - to be [packed](packed.html). Only one packed grammar can currently be used at a time. - -For reference, the following two translation model lines are used by the [pipeline](pipeline.html): - - tm = thrax pt 20 /path/to/packed/grammar - tm = thrax glue -1 /path/to/glue/grammar - -### Language model options <a id="lm" /> - -Joshua supports any number of language models. To add a language -model, add a line of the following format to the configuration file: - - lm = TYPE ORDER LEFT_STATE RIGHT_STATE CEILING_COST FILE - -where the six fields correspond to the following values: - -* `TYPE`: one of "kenlm", "berkeleylm", or "none" -* `ORDER`: the order of the language model -* `LEFT_STATE`: whether to use left-state minimization; currently only supported by KenLM -* `RIGHT_STATE`: whether to use right equivalent state (currently unsupported) -* `CEILING_COST`: the LM-specific ceiling cost of all n-grams (currently ignored) -* `FILE`: the path to the language model file. All language model types support the standard ARPA - format. Additionally, if the LM type is "kenlm", this file can be compiled into KenLM's compiled - format (using the program at `$JOSHUA/bin/build_binary`); if the the LM type is "berkeleylm", it - can be compiled by following the directions in - `$JOSHUA/src/joshua/decoder/ff/lm/berkeley_lm/README`. The [pipeline](pipeline.html) will - automatically compile either type. - -For each language model, you need to specify a feature weight in the following format: - - lm_0 WEIGHT - lm_1 WEIGHT - ... - -where the indices correspond to the order of the language model declaration lines. - -### Output options <a id="output" /> - -- `output-format` *New in 5.0* - - Joshua prints a lot of information to STDERR (making this more granular is on the TODO - list). Output to STDOUT is reserved for decoder translations, and is controlled by the - - - `%i`: the sentence number (0-indexed) - - - `%e`: the source sentence - - - `%s`: the translated sentence - - - `%S`: the translated sentence, with some basic capitalization and denomralization. e.g., - - $ echo "¿ who you lookin' at , mr. ?" | $JOSHUA/bin/decoder -output-format "%S" -mark-oovs false 2> /dev/null - ¿Who you lookin' at, Mr.? - - - `%t`: the synchronous derivation - - - `%f`: the list of feature values (as name=value pairs) - - - `%c`: the model cost - - - `%w`: the weight vector (unimplemented) - - - `%a`: the alignments between source and target words (unimplemented) - - The default value is - - output-format = %i ||| %s ||| %f ||| %c - - i.e., - - input ID ||| translation ||| model scores ||| score - -- `top-n` --- *300* - - The number of translation hypotheses to output, sorted in decreasing order of model score - -- `use-unique-nbest` --- *true* - - When constructing the n-best list for a sentence, skip hypotheses whose string has already been - output. - -- `escape-trees` --- *false* - -- `include-align-index` --- *false* - - Output the source words indices that each target word aligns to. - -- `mark-oovs` --- *false* - - if `true`, this causes the text "_OOV" to be appended to each untranslated word in the output. - -- `visualize-hypergraph` --- *false* - - If set to true, a visualization of the hypergraph will be displayed, though you will have to - explicitly include the relevant jar files. See the example usage in - `$JOSHUA/examples/tree_visualizer/`, which contains a demonstration of a source sentence, - translation, and synchronous derivation. - -- `dump-hypergraph` --- "" - - This feature directs that the hypergraph should be written to disk for each input sentence. If - set, the value should contain the string "%d", which is replaced with the sentence number. For - example, - - cat input.txt | $JOSHUA/bin/decoder -dump-hypergraph hgs/%d.txt - - Note that the output directory must exist. - - TODO: revive the - [discussion on a common hypergraph format](http://aclweb.org/aclwiki/index.php?title=Hypergraph_Format) - on the ACL Wiki and support that format. - -### Lattice decoding - -In addition to regular sentences, Joshua can decode weighted lattices encoded in -[the PLF format](http://www.statmt.org/moses/?n=Moses.WordLattices), except that path costs should -be listed as <b>log probabilities</b> instead of probabilities. Lattice decoding was originally -added by Lane Schwartz and [Chris Dyer](http://www.cs.cmu.edu/~cdyer/). - -Joshua will automatically detect whether the input sentence is a regular sentence (the usual case) -or a lattice. If a lattice, a feature will be activated that accumulates the cost of different -paths through the lattice. In this case, you need to ensure that a weight for this feature is -present in [your model file](decoder.html). The [pipeline](pipeline.html) will handle this -automatically, or if you are doing this manually, you can add the line - - SourcePath COST - -to your Joshua configuration file. - -Lattices must be listed one per line. - -### Alternate modes of operation <a id="modes" /> - -In addition to decoding input sentences in the standard way, Joshua supports both *constrained -decoding* and *synchronous parsing*. In both settings, both the source and target sides are provided -as input, and the decoder finds a derivation between them. - -#### Constrained decoding - -To enable constrained decoding, simply append the desired target string as part of the input, in -the following format: - - source sentence ||| target sentence - -Joshua will translate the source sentence constrained to the target sentence. There are a few -caveats: - - * Left-state minimization cannot be enabled for the language model - - * A heuristic is used to constrain the derivation (the LM state must match against the - input). This is not a perfect heuristic, and sometimes results in analyses that are not - perfectly constrained to the input, but have extra words. - -#### Synchronous parsing - -Joshua supports synchronous parsing as a two-step sequence of monolingual parses, as described in -Dyer (NAACL 2010) ([PDF](http://www.aclweb.org/anthology/N10-1033â.pdf)). To enable this: - - - Set the configuration parameter `parse = true`. - - - Remove all language models from the input file - - - Provide input in the following format: - - source sentence ||| target sentence - -You may also wish to display the synchronouse parse tree (`-output-format %t`) and the alignment -(`-show-align-index`). - http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/faq.html ---------------------------------------------------------------------- diff --git a/5.0/faq.html b/5.0/faq.html new file mode 100644 index 0000000..b6f2681 --- /dev/null +++ b/5.0/faq.html @@ -0,0 +1,170 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <title>Joshua Documentation | Common problems</title> + <meta name="viewport" content="width=device-width, initial-scale=1.0"> + <meta name="description" content=""> + <meta name="author" content=""> + + <!-- Le styles --> + <link href="/bootstrap/css/bootstrap.css" rel="stylesheet"> + <style> + body { + padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */ + } + #download { + background-color: green; + font-size: 14pt; + font-weight: bold; + text-align: center; + color: white; + border-radius: 5px; + padding: 4px; + } + + #download a:link { + color: white; + } + + #download a:hover { + color: lightgrey; + } + + #download a:visited { + color: white; + } + + a.pdf { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: brown; + padding: 2px; + } + + a.bibtex { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: orange; + padding: 2px; + } + + img.sponsor { + height: 120px; + margin: 5px; + } + </style> + <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet"> + + <!-- HTML5 shim, for IE6-8 support of HTML5 elements --> + <!--[if lt IE 9]> + <script src="bootstrap/js/html5shiv.js"></script> + <![endif]--> + + <!-- Fav and touch icons --> + <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png"> + <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png"> + <link rel="shortcut icon" href="bootstrap/ico/favicon.png"> + </head> + + <body> + + <div class="navbar navbar-inverse navbar-fixed-top"> + <div class="navbar-inner"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="brand" href="/">Joshua</a> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="index.html">Documentation</a></li> + <li><a href="pipeline.html">Pipeline</a></li> + <li><a href="tutorial.html">Tutorial</a></li> + <li><a href="decoder.html">Decoder</a></li> + <li><a href="thrax.html">Thrax</a></li> + <li><a href="file-formats.html">File formats</a></li> + <!-- <li><a href="advanced.html">Advanced</a></li> --> + <li><a href="faq.html">FAQ</a></li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + + <div class="container"> + + <div class="row"> + <div class="span2"> + <img src="/images/joshua-logo-small.png" + alt="Joshua logo (picture of a Joshua tree)" /> + </div> + <div class="span10"> + <h1>Joshua Documentation</h1> + <h2>Common problems</h2> + <span id="download"> + <a href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz">Download</a> + </span> + (version 5.0, released 16 August 2013) + </div> + </div> + + <hr /> + + <div class="row"> + <div class="span8"> + + <p>Solutions to common problems will be posted here as we become aware of them.</p> + + + </div> + </div> + </div> <!-- /container --> + + <!-- Le javascript + ================================================== --> + <!-- Placed at the end of the document so the pages load faster --> + <script src="bootstrap/js/jquery.js"></script> + <script src="bootstrap/js/bootstrap-transition.js"></script> + <script src="bootstrap/js/bootstrap-alert.js"></script> + <script src="bootstrap/js/bootstrap-modal.js"></script> + <script src="bootstrap/js/bootstrap-dropdown.js"></script> + <script src="bootstrap/js/bootstrap-scrollspy.js"></script> + <script src="bootstrap/js/bootstrap-tab.js"></script> + <script src="bootstrap/js/bootstrap-tooltip.js"></script> + <script src="bootstrap/js/bootstrap-popover.js"></script> + <script src="bootstrap/js/bootstrap-button.js"></script> + <script src="bootstrap/js/bootstrap-collapse.js"></script> + <script src="bootstrap/js/bootstrap-carousel.js"></script> + <script src="bootstrap/js/bootstrap-typeahead.js"></script> + + <!-- Start of StatCounter Code for Default Guide --> + <script type="text/javascript"> + var sc_project=8264132; + var sc_invisible=1; + var sc_security="4b97fe2d"; + </script> + <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script> + <noscript> + <div class="statcounter"> + <a title="hit counter joomla" + href="http://statcounter.com/joomla/" + target="_blank"> + <img class="statcounter" + src="http://c.statcounter.com/8264132/0/4b97fe2d/1/" + alt="hit counter joomla" /> + </a> + </div> + </noscript> + <!-- End of StatCounter Code for Default Guide --> + + </body> +</html> http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/faq.md ---------------------------------------------------------------------- diff --git a/5.0/faq.md b/5.0/faq.md deleted file mode 100644 index 2ac67ba..0000000 --- a/5.0/faq.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -layout: default -category: help -title: Common problems ---- - -Solutions to common problems will be posted here as we become aware of them. http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/features.html ---------------------------------------------------------------------- diff --git a/5.0/features.html b/5.0/features.html new file mode 100644 index 0000000..32cd205 --- /dev/null +++ b/5.0/features.html @@ -0,0 +1,170 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <title>Joshua Documentation | Features</title> + <meta name="viewport" content="width=device-width, initial-scale=1.0"> + <meta name="description" content=""> + <meta name="author" content=""> + + <!-- Le styles --> + <link href="/bootstrap/css/bootstrap.css" rel="stylesheet"> + <style> + body { + padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */ + } + #download { + background-color: green; + font-size: 14pt; + font-weight: bold; + text-align: center; + color: white; + border-radius: 5px; + padding: 4px; + } + + #download a:link { + color: white; + } + + #download a:hover { + color: lightgrey; + } + + #download a:visited { + color: white; + } + + a.pdf { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: brown; + padding: 2px; + } + + a.bibtex { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: orange; + padding: 2px; + } + + img.sponsor { + height: 120px; + margin: 5px; + } + </style> + <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet"> + + <!-- HTML5 shim, for IE6-8 support of HTML5 elements --> + <!--[if lt IE 9]> + <script src="bootstrap/js/html5shiv.js"></script> + <![endif]--> + + <!-- Fav and touch icons --> + <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png"> + <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png"> + <link rel="shortcut icon" href="bootstrap/ico/favicon.png"> + </head> + + <body> + + <div class="navbar navbar-inverse navbar-fixed-top"> + <div class="navbar-inner"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="brand" href="/">Joshua</a> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="index.html">Documentation</a></li> + <li><a href="pipeline.html">Pipeline</a></li> + <li><a href="tutorial.html">Tutorial</a></li> + <li><a href="decoder.html">Decoder</a></li> + <li><a href="thrax.html">Thrax</a></li> + <li><a href="file-formats.html">File formats</a></li> + <!-- <li><a href="advanced.html">Advanced</a></li> --> + <li><a href="faq.html">FAQ</a></li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + + <div class="container"> + + <div class="row"> + <div class="span2"> + <img src="/images/joshua-logo-small.png" + alt="Joshua logo (picture of a Joshua tree)" /> + </div> + <div class="span10"> + <h1>Joshua Documentation</h1> + <h2>Features</h2> + <span id="download"> + <a href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz">Download</a> + </span> + (version 5.0, released 16 August 2013) + </div> + </div> + + <hr /> + + <div class="row"> + <div class="span8"> + + <p>Joshua 5.0 uses a sparse feature representation to encode features internally.</p> + + + </div> + </div> + </div> <!-- /container --> + + <!-- Le javascript + ================================================== --> + <!-- Placed at the end of the document so the pages load faster --> + <script src="bootstrap/js/jquery.js"></script> + <script src="bootstrap/js/bootstrap-transition.js"></script> + <script src="bootstrap/js/bootstrap-alert.js"></script> + <script src="bootstrap/js/bootstrap-modal.js"></script> + <script src="bootstrap/js/bootstrap-dropdown.js"></script> + <script src="bootstrap/js/bootstrap-scrollspy.js"></script> + <script src="bootstrap/js/bootstrap-tab.js"></script> + <script src="bootstrap/js/bootstrap-tooltip.js"></script> + <script src="bootstrap/js/bootstrap-popover.js"></script> + <script src="bootstrap/js/bootstrap-button.js"></script> + <script src="bootstrap/js/bootstrap-collapse.js"></script> + <script src="bootstrap/js/bootstrap-carousel.js"></script> + <script src="bootstrap/js/bootstrap-typeahead.js"></script> + + <!-- Start of StatCounter Code for Default Guide --> + <script type="text/javascript"> + var sc_project=8264132; + var sc_invisible=1; + var sc_security="4b97fe2d"; + </script> + <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script> + <noscript> + <div class="statcounter"> + <a title="hit counter joomla" + href="http://statcounter.com/joomla/" + target="_blank"> + <img class="statcounter" + src="http://c.statcounter.com/8264132/0/4b97fe2d/1/" + alt="hit counter joomla" /> + </a> + </div> + </noscript> + <!-- End of StatCounter Code for Default Guide --> + + </body> +</html> http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/features.md ---------------------------------------------------------------------- diff --git a/5.0/features.md b/5.0/features.md deleted file mode 100644 index 7613954..0000000 --- a/5.0/features.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -layout: default -title: Features ---- - -Joshua 5.0 uses a sparse feature representation to encode features internally. http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/file-formats.html ---------------------------------------------------------------------- diff --git a/5.0/file-formats.html b/5.0/file-formats.html new file mode 100644 index 0000000..ab7de65 --- /dev/null +++ b/5.0/file-formats.html @@ -0,0 +1,248 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <title>Joshua Documentation | Joshua file formats</title> + <meta name="viewport" content="width=device-width, initial-scale=1.0"> + <meta name="description" content=""> + <meta name="author" content=""> + + <!-- Le styles --> + <link href="/bootstrap/css/bootstrap.css" rel="stylesheet"> + <style> + body { + padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */ + } + #download { + background-color: green; + font-size: 14pt; + font-weight: bold; + text-align: center; + color: white; + border-radius: 5px; + padding: 4px; + } + + #download a:link { + color: white; + } + + #download a:hover { + color: lightgrey; + } + + #download a:visited { + color: white; + } + + a.pdf { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: brown; + padding: 2px; + } + + a.bibtex { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: orange; + padding: 2px; + } + + img.sponsor { + height: 120px; + margin: 5px; + } + </style> + <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet"> + + <!-- HTML5 shim, for IE6-8 support of HTML5 elements --> + <!--[if lt IE 9]> + <script src="bootstrap/js/html5shiv.js"></script> + <![endif]--> + + <!-- Fav and touch icons --> + <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png"> + <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png"> + <link rel="shortcut icon" href="bootstrap/ico/favicon.png"> + </head> + + <body> + + <div class="navbar navbar-inverse navbar-fixed-top"> + <div class="navbar-inner"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="brand" href="/">Joshua</a> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="index.html">Documentation</a></li> + <li><a href="pipeline.html">Pipeline</a></li> + <li><a href="tutorial.html">Tutorial</a></li> + <li><a href="decoder.html">Decoder</a></li> + <li><a href="thrax.html">Thrax</a></li> + <li><a href="file-formats.html">File formats</a></li> + <!-- <li><a href="advanced.html">Advanced</a></li> --> + <li><a href="faq.html">FAQ</a></li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + + <div class="container"> + + <div class="row"> + <div class="span2"> + <img src="/images/joshua-logo-small.png" + alt="Joshua logo (picture of a Joshua tree)" /> + </div> + <div class="span10"> + <h1>Joshua Documentation</h1> + <h2>Joshua file formats</h2> + <span id="download"> + <a href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz">Download</a> + </span> + (version 5.0, released 16 August 2013) + </div> + </div> + + <hr /> + + <div class="row"> + <div class="span8"> + + <p>This page describes the formats of Joshua configuration and support files.</p> + +<h2 id="translation-models-grammars">Translation models (grammars)</h2> + +<p>Joshua supports two grammar file formats: a text-based version (also used by Hiero, shared by +<a href="">cdec</a>, and supported by <a href="">hierarchical Moses</a>), and an efficient +<a href="packing.html">packed representation</a> developed by <a href="http://cs.jhu.edu/~juri">Juri Ganitkevich</a>.</p> + +<p>Grammar rules follow this format.</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>[LHS] ||| SOURCE-SIDE ||| TARGET-SIDE ||| FEATURES +</code></pre> +</div> + +<p>The source and target sides contain a mixture of terminals and nonterminals. The nonterminals are +linked across sides by indices. There is no limit to the number of paired nonterminals in the rule +or on the nonterminal labels (Joshua supports decoding with SAMT and GHKM grammars).</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>[X] ||| el chico [X,1] ||| the boy [X,1] ||| -3.14 0 2 17 +[S] ||| el chico [VP,1] ||| the boy [VP,1] ||| -3.14 0 2 17 +[VP] ||| [NP,1] [IN,2] [VB,3] ||| [VB,3] [IN,2] [NP,1] ||| 0.0019026637 0.81322956 +</code></pre> +</div> + +<p>The feature values can have optional labels, e.g.:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>[X] ||| el chico [X,1] ||| the boy [X,1] ||| lexprob=-3.14 lexicalized=1 numwords=2 count=17 +</code></pre> +</div> + +<p>One file common to decoding is the glue grammar, which for hiero grammar is defined as follows:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>[GOAL] ||| <s> ||| <s> ||| 0 +[GOAL] ||| [GOAL,1] [X,2] ||| [GOAL,1] [X,2] ||| -1 +[GOAL] ||| [GOAL,1] </s> ||| [GOAL,1] </s> ||| 0 +</code></pre> +</div> + +<p>Joshuaâs <a href="pipeline.html">pipeline</a> supports extraction of Hiero and SAMT grammars via +<a href="thrax.html">Thrax</a> or GHKM grammars using <a href="http://www-nlp.stanford.edu/~mgalley/">Michel Galley</a>âs +GHKM extractor (included) or Mosesâ GHKM extractor (if Moses is installed).</p> + +<h2 id="language-model">Language Model</h2> + +<p>Joshua has two language model implementations: <a href="http://kheafield.com/code/kenlm/">KenLM</a> and +<a href="http://berkeleylm.googlecode.com">BerkeleyLM</a>. All language model implementations support the +standard ARPA format output by <a href="http://www.speech.sri.com/projects/srilm/">SRILM</a>. In addition, +KenLM and BerkeleyLM support compiled formats that can be loaded more quickly and efficiently. KenLM +is written in C++ and is supported via a JNI bridge, while BerkeleyLM is written in Java. KenLM is +the default because of its support for left-state minimization.</p> + +<h3 id="compiling-for-kenlm">Compiling for KenLM</h3> + +<p>To compile an ARPA grammar for KenLM, use the (provided) <code class="highlighter-rouge">build-binary</code> command, located deep within +the Joshua source code:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/build_binary lm.arpa lm.kenlm +</code></pre> +</div> + +<p>This script takes the <code class="highlighter-rouge">lm.arpa</code> file and produces the compiled version in <code class="highlighter-rouge">lm.kenlm</code>.</p> + +<h3 id="compiling-for-berkeleylm">Compiling for BerkeleyLM</h3> + +<p>To compile a grammar for BerkeleyLM, type:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>java -cp $JOSHUA/lib/berkeleylm.jar -server -mxMEM edu.berkeley.nlp.lm.io.MakeLmBinaryFromArpa lm.arpa lm.berkeleylm +</code></pre> +</div> + +<p>The <code class="highlighter-rouge">lm.berkeleylm</code> file can then be listed directly in the <a href="decoder.html">Joshua configuration file</a>.</p> + +<h2 id="joshua-configuration-file">Joshua configuration file</h2> + +<p>The <a href="decoder.html">decoder page</a> documents decoder command-line and config file options.</p> + +<h2 id="thrax-configuration">Thrax configuration</h2> + +<p>See <a href="thrax.html">the thrax page</a> for more information about the Thrax configuration file.</p> + + + </div> + </div> + </div> <!-- /container --> + + <!-- Le javascript + ================================================== --> + <!-- Placed at the end of the document so the pages load faster --> + <script src="bootstrap/js/jquery.js"></script> + <script src="bootstrap/js/bootstrap-transition.js"></script> + <script src="bootstrap/js/bootstrap-alert.js"></script> + <script src="bootstrap/js/bootstrap-modal.js"></script> + <script src="bootstrap/js/bootstrap-dropdown.js"></script> + <script src="bootstrap/js/bootstrap-scrollspy.js"></script> + <script src="bootstrap/js/bootstrap-tab.js"></script> + <script src="bootstrap/js/bootstrap-tooltip.js"></script> + <script src="bootstrap/js/bootstrap-popover.js"></script> + <script src="bootstrap/js/bootstrap-button.js"></script> + <script src="bootstrap/js/bootstrap-collapse.js"></script> + <script src="bootstrap/js/bootstrap-carousel.js"></script> + <script src="bootstrap/js/bootstrap-typeahead.js"></script> + + <!-- Start of StatCounter Code for Default Guide --> + <script type="text/javascript"> + var sc_project=8264132; + var sc_invisible=1; + var sc_security="4b97fe2d"; + </script> + <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script> + <noscript> + <div class="statcounter"> + <a title="hit counter joomla" + href="http://statcounter.com/joomla/" + target="_blank"> + <img class="statcounter" + src="http://c.statcounter.com/8264132/0/4b97fe2d/1/" + alt="hit counter joomla" /> + </a> + </div> + </noscript> + <!-- End of StatCounter Code for Default Guide --> + + </body> +</html> http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/file-formats.md ---------------------------------------------------------------------- diff --git a/5.0/file-formats.md b/5.0/file-formats.md deleted file mode 100644 index a53d661..0000000 --- a/5.0/file-formats.md +++ /dev/null @@ -1,72 +0,0 @@ ---- -layout: default -category: advanced -title: Joshua file formats ---- -This page describes the formats of Joshua configuration and support files. - -## Translation models (grammars) - -Joshua supports two grammar file formats: a text-based version (also used by Hiero, shared by -[cdec](), and supported by [hierarchical Moses]()), and an efficient -[packed representation](packing.html) developed by [Juri Ganitkevich](http://cs.jhu.edu/~juri). - -Grammar rules follow this format. - - [LHS] ||| SOURCE-SIDE ||| TARGET-SIDE ||| FEATURES - -The source and target sides contain a mixture of terminals and nonterminals. The nonterminals are -linked across sides by indices. There is no limit to the number of paired nonterminals in the rule -or on the nonterminal labels (Joshua supports decoding with SAMT and GHKM grammars). - - [X] ||| el chico [X,1] ||| the boy [X,1] ||| -3.14 0 2 17 - [S] ||| el chico [VP,1] ||| the boy [VP,1] ||| -3.14 0 2 17 - [VP] ||| [NP,1] [IN,2] [VB,3] ||| [VB,3] [IN,2] [NP,1] ||| 0.0019026637 0.81322956 - -The feature values can have optional labels, e.g.: - - [X] ||| el chico [X,1] ||| the boy [X,1] ||| lexprob=-3.14 lexicalized=1 numwords=2 count=17 - -One file common to decoding is the glue grammar, which for hiero grammar is defined as follows: - - [GOAL] ||| <s> ||| <s> ||| 0 - [GOAL] ||| [GOAL,1] [X,2] ||| [GOAL,1] [X,2] ||| -1 - [GOAL] ||| [GOAL,1] </s> ||| [GOAL,1] </s> ||| 0 - -Joshua's [pipeline](pipeline.html) supports extraction of Hiero and SAMT grammars via -[Thrax](thrax.html) or GHKM grammars using [Michel Galley](http://www-nlp.stanford.edu/~mgalley/)'s -GHKM extractor (included) or Moses' GHKM extractor (if Moses is installed). - -## Language Model - -Joshua has two language model implementations: [KenLM](http://kheafield.com/code/kenlm/) and -[BerkeleyLM](http://berkeleylm.googlecode.com). All language model implementations support the -standard ARPA format output by [SRILM](http://www.speech.sri.com/projects/srilm/). In addition, -KenLM and BerkeleyLM support compiled formats that can be loaded more quickly and efficiently. KenLM -is written in C++ and is supported via a JNI bridge, while BerkeleyLM is written in Java. KenLM is -the default because of its support for left-state minimization. - -### Compiling for KenLM - -To compile an ARPA grammar for KenLM, use the (provided) `build-binary` command, located deep within -the Joshua source code: - - $JOSHUA/bin/build_binary lm.arpa lm.kenlm - -This script takes the `lm.arpa` file and produces the compiled version in `lm.kenlm`. - -### Compiling for BerkeleyLM - -To compile a grammar for BerkeleyLM, type: - - java -cp $JOSHUA/lib/berkeleylm.jar -server -mxMEM edu.berkeley.nlp.lm.io.MakeLmBinaryFromArpa lm.arpa lm.berkeleylm - -The `lm.berkeleylm` file can then be listed directly in the [Joshua configuration file](decoder.html). - -## Joshua configuration file - -The [decoder page](decoder.html) documents decoder command-line and config file options. - -## Thrax configuration - -See [the thrax page](thrax.html) for more information about the Thrax configuration file. http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/index.html ---------------------------------------------------------------------- diff --git a/5.0/index.html b/5.0/index.html new file mode 100644 index 0000000..8ea323a --- /dev/null +++ b/5.0/index.html @@ -0,0 +1,255 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <title>Joshua Documentation | Getting Started</title> + <meta name="viewport" content="width=device-width, initial-scale=1.0"> + <meta name="description" content=""> + <meta name="author" content=""> + + <!-- Le styles --> + <link href="/bootstrap/css/bootstrap.css" rel="stylesheet"> + <style> + body { + padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */ + } + #download { + background-color: green; + font-size: 14pt; + font-weight: bold; + text-align: center; + color: white; + border-radius: 5px; + padding: 4px; + } + + #download a:link { + color: white; + } + + #download a:hover { + color: lightgrey; + } + + #download a:visited { + color: white; + } + + a.pdf { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: brown; + padding: 2px; + } + + a.bibtex { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: orange; + padding: 2px; + } + + img.sponsor { + height: 120px; + margin: 5px; + } + </style> + <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet"> + + <!-- HTML5 shim, for IE6-8 support of HTML5 elements --> + <!--[if lt IE 9]> + <script src="bootstrap/js/html5shiv.js"></script> + <![endif]--> + + <!-- Fav and touch icons --> + <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png"> + <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png"> + <link rel="shortcut icon" href="bootstrap/ico/favicon.png"> + </head> + + <body> + + <div class="navbar navbar-inverse navbar-fixed-top"> + <div class="navbar-inner"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="brand" href="/">Joshua</a> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="index.html">Documentation</a></li> + <li><a href="pipeline.html">Pipeline</a></li> + <li><a href="tutorial.html">Tutorial</a></li> + <li><a href="decoder.html">Decoder</a></li> + <li><a href="thrax.html">Thrax</a></li> + <li><a href="file-formats.html">File formats</a></li> + <!-- <li><a href="advanced.html">Advanced</a></li> --> + <li><a href="faq.html">FAQ</a></li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + + <div class="container"> + + <div class="row"> + <div class="span2"> + <img src="/images/joshua-logo-small.png" + alt="Joshua logo (picture of a Joshua tree)" /> + </div> + <div class="span10"> + <h1>Joshua Documentation</h1> + <h2>Getting Started</h2> + <span id="download"> + <a href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz">Download</a> + </span> + (version 5.0, released 16 August 2013) + </div> + </div> + + <hr /> + + <div class="row"> + <div class="span8"> + + <p>This page contains end-user oriented documentation for the 5.0 release of +<a href="http://joshua-decoder.org/">the Joshua decoder</a>.</p> + +<h2 id="download-and-setup">Download and Setup</h2> + +<ol> + <li> + <p>Download Joshua by clicking the big green button above, or from the command line:</p> + + <div class="highlighter-rouge"><pre class="highlight"><code>wget -q http://cs.jhu.edu/~post/files/joshua-v5.0.tgz +</code></pre> + </div> + </li> + <li> + <p>Next, unpack it, set environment variables, and compile everything:</p> + + <div class="highlighter-rouge"><pre class="highlight"><code>tar xzf joshua-v5.0.tgz +cd joshua-v5.0 + +# for bash +export JAVA_HOME=/path/to/java +export JOSHUA=$(pwd) +echo "export JOSHUA=$JOSHUA" >> ~/.bashrc + +# for tcsh +setenv JAVA_HOME /path/to/java +setenv JOSHUA `pwd` +echo "setenv JOSHUA $JOSHUA" >> ~/.profile + +ant +</code></pre> + </div> + + <p>(If you donât know what to set <code class="highlighter-rouge">$JAVA_HOME</code> to, try <code class="highlighter-rouge">/usr/java/default</code>)</p> + </li> + <li> + <p>If you have a Hadoop installation, make sure that the environment variable <code class="highlighter-rouge">$HADOOP</code> is set and +points to it. If you donât, Joshua will roll one out for you in standalone mode.</p> + </li> + <li> + <p>If you want to use Cherry & Fosterâs +<a href="http://aclweb.org/anthology-new/N/N12/N12-1047v2.pdf">batch MIRA tuner</a> (recommended), you need to +<a href="http://www.statmt.org/moses/?n=Development.GetStarted">install Moses</a> and define the <code class="highlighter-rouge">$MOSES</code> +environment variable to point to the root of the Moses installation.</p> + </li> +</ol> + +<h2 id="quick-start">Quick start</h2> + +<p>Our <a href="pipeline.html">pipeline script</a> is the quickest way to get started. For example, to +train and test a complete model translating from Bengali to English:</p> + +<p>First, download the Indian languages data:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>wget --no-check -O indian-languages.tgz https://github.com/joshua-decoder/indian-parallel-corpora/tarball/master +tar xf indian-languages.tgz +ln -s joshua-decoder-indian-parallel-corpora-b71d31a input +</code></pre> +</div> + +<p>Then, train and test a model</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/pipeline.pl --source bn --target en \ + --no-prepare --aligner berkeley \ + --corpus input/bn-en/tok/training.bn-en \ + --tune input/bn-en/tok/dev.bn-en \ + --test input/bn-en/tok/devtest.bn-en +</code></pre> +</div> + +<p>This will align the data with the Berkeley aligner, build a Hiero model, tune with MERT, decode the +test sets, and reports results that should correspond with what you find on <a href="/indian-parallel-corpora/">the Indian Parallel Corpora page</a>. For +more details, including information on the many options available with the pipeline script, please +see <a href="pipeline.html">its documentation page</a>.</p> + +<h2 id="more-information">More information</h2> + +<p>For more detail on the decoder itself, including its command-line options, see +<a href="decoder.html">the Joshua decoder page</a>. You can also learn more about other steps of +<a href="pipeline.html">the Joshua MT pipeline</a>, including <a href="thrax.html">grammar extraction</a> with Thrax and +Joshuaâs <a href="packing.html">efficient grammar representation</a>.</p> + +<p>If you have problems or issues, you might find some help <a href="faq.html">on our answers page</a> or +<a href="https://groups.google.com/forum/?fromgroups#!forum/joshua_support">in the mailing list archives</a>.</p> + +<p>A <a href="bundle.html">bundled configuration</a>, which is a minimal set of configuration, resource, and script files, can be created and easily transferred and shared.</p> + + + </div> + </div> + </div> <!-- /container --> + + <!-- Le javascript + ================================================== --> + <!-- Placed at the end of the document so the pages load faster --> + <script src="bootstrap/js/jquery.js"></script> + <script src="bootstrap/js/bootstrap-transition.js"></script> + <script src="bootstrap/js/bootstrap-alert.js"></script> + <script src="bootstrap/js/bootstrap-modal.js"></script> + <script src="bootstrap/js/bootstrap-dropdown.js"></script> + <script src="bootstrap/js/bootstrap-scrollspy.js"></script> + <script src="bootstrap/js/bootstrap-tab.js"></script> + <script src="bootstrap/js/bootstrap-tooltip.js"></script> + <script src="bootstrap/js/bootstrap-popover.js"></script> + <script src="bootstrap/js/bootstrap-button.js"></script> + <script src="bootstrap/js/bootstrap-collapse.js"></script> + <script src="bootstrap/js/bootstrap-carousel.js"></script> + <script src="bootstrap/js/bootstrap-typeahead.js"></script> + + <!-- Start of StatCounter Code for Default Guide --> + <script type="text/javascript"> + var sc_project=8264132; + var sc_invisible=1; + var sc_security="4b97fe2d"; + </script> + <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script> + <noscript> + <div class="statcounter"> + <a title="hit counter joomla" + href="http://statcounter.com/joomla/" + target="_blank"> + <img class="statcounter" + src="http://c.statcounter.com/8264132/0/4b97fe2d/1/" + alt="hit counter joomla" /> + </a> + </div> + </noscript> + <!-- End of StatCounter Code for Default Guide --> + + </body> +</html> http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/index.md ---------------------------------------------------------------------- diff --git a/5.0/index.md b/5.0/index.md deleted file mode 100644 index 7a1d016..0000000 --- a/5.0/index.md +++ /dev/null @@ -1,77 +0,0 @@ ---- -layout: default -title: Getting Started ---- - -This page contains end-user oriented documentation for the 5.0 release of -[the Joshua decoder](http://joshua-decoder.org/). - -## Download and Setup - -1. Download Joshua by clicking the big green button above, or from the command line: - - wget -q http://cs.jhu.edu/~post/files/joshua-v5.0.tgz - -1. Next, unpack it, set environment variables, and compile everything: - - tar xzf joshua-v5.0.tgz - cd joshua-v5.0 - - # for bash - export JAVA_HOME=/path/to/java - export JOSHUA=$(pwd) - echo "export JOSHUA=$JOSHUA" >> ~/.bashrc - - # for tcsh - setenv JAVA_HOME /path/to/java - setenv JOSHUA `pwd` - echo "setenv JOSHUA $JOSHUA" >> ~/.profile - - ant - - (If you don't know what to set `$JAVA_HOME` to, try `/usr/java/default`) - -3. If you have a Hadoop installation, make sure that the environment variable `$HADOOP` is set and -points to it. If you don't, Joshua will roll one out for you in standalone mode. - -4. If you want to use Cherry & Foster's -[batch MIRA tuner](http://aclweb.org/anthology-new/N/N12/N12-1047v2.pdf) (recommended), you need to -[install Moses](http://www.statmt.org/moses/?n=Development.GetStarted) and define the `$MOSES` -environment variable to point to the root of the Moses installation. - -## Quick start - -Our <a href="pipeline.html">pipeline script</a> is the quickest way to get started. For example, to -train and test a complete model translating from Bengali to English: - -First, download the Indian languages data: - - wget --no-check -O indian-languages.tgz https://github.com/joshua-decoder/indian-parallel-corpora/tarball/master - tar xf indian-languages.tgz - ln -s joshua-decoder-indian-parallel-corpora-b71d31a input - -Then, train and test a model - - $JOSHUA/bin/pipeline.pl --source bn --target en \ - --no-prepare --aligner berkeley \ - --corpus input/bn-en/tok/training.bn-en \ - --tune input/bn-en/tok/dev.bn-en \ - --test input/bn-en/tok/devtest.bn-en - -This will align the data with the Berkeley aligner, build a Hiero model, tune with MERT, decode the -test sets, and reports results that should correspond with what you find on <a -href="/indian-parallel-corpora/">the Indian Parallel Corpora page</a>. For -more details, including information on the many options available with the pipeline script, please -see <a href="pipeline.html">its documentation page</a>. - -## More information - -For more detail on the decoder itself, including its command-line options, see -[the Joshua decoder page](decoder.html). You can also learn more about other steps of -[the Joshua MT pipeline](pipeline.html), including [grammar extraction](thrax.html) with Thrax and -Joshua's [efficient grammar representation](packing.html). - -If you have problems or issues, you might find some help [on our answers page](faq.html) or -[in the mailing list archives](https://groups.google.com/forum/?fromgroups#!forum/joshua_support). - -A [bundled configuration](bundle.html), which is a minimal set of configuration, resource, and script files, can be created and easily transferred and shared.
