http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/server.md ---------------------------------------------------------------------- diff --git a/5.0/server.md b/5.0/server.md deleted file mode 100644 index 52b2a66..0000000 --- a/5.0/server.md +++ /dev/null @@ -1,30 +0,0 @@ ---- -layout: default -category: links -title: Server mode ---- - -The Joshua decoder can be run as a TCP/IP server instead of a POSIX-style command-line tool. Clients can concurrently connect to a socket and receive a set of newline-separated outputs for a set of newline-separated inputs. - -Threading takes place both within and across requests. Threads from the decoder pool are assigned in round-robin manner across requests, preventing starvation. - - -# Invoking the server - -A running server is configured at invokation time. To start in server mode, run `joshua-decoder` with the option `-server-port [PORT]`. Additionally, the server can be configured in the same ways as when using the command-line-functionality. - -E.g., - - $JOSHUA/bin/joshua-decoder -server-port 10101 -mark-oovs false -output-format "%s" -threads 10 - -## Using the server - -To test that the server is working, a set of inputs can be sent to the server from the command line. - -The server, as configured in the example above, will then respond to requests on port 10101. You can test it out with the `nc` utility: - - wget -qO - http://cs.jhu.edu/~post/files/pg1023.txt | head -132 | tail -11 | nc localhost 10101 - -Since no model was loaded, this will just return the text to you as sent to the server. - -The `-server-port` option can also be used when creating a [bundled configuration](bundle.html) that will be run in server mode.
http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/thrax.html ---------------------------------------------------------------------- diff --git a/5.0/thrax.html b/5.0/thrax.html new file mode 100644 index 0000000..cbbfdee --- /dev/null +++ b/5.0/thrax.html @@ -0,0 +1,177 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <title>Joshua Documentation | Grammar extraction with Thrax</title> + <meta name="viewport" content="width=device-width, initial-scale=1.0"> + <meta name="description" content=""> + <meta name="author" content=""> + + <!-- Le styles --> + <link href="/bootstrap/css/bootstrap.css" rel="stylesheet"> + <style> + body { + padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */ + } + #download { + background-color: green; + font-size: 14pt; + font-weight: bold; + text-align: center; + color: white; + border-radius: 5px; + padding: 4px; + } + + #download a:link { + color: white; + } + + #download a:hover { + color: lightgrey; + } + + #download a:visited { + color: white; + } + + a.pdf { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: brown; + padding: 2px; + } + + a.bibtex { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: orange; + padding: 2px; + } + + img.sponsor { + height: 120px; + margin: 5px; + } + </style> + <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet"> + + <!-- HTML5 shim, for IE6-8 support of HTML5 elements --> + <!--[if lt IE 9]> + <script src="bootstrap/js/html5shiv.js"></script> + <![endif]--> + + <!-- Fav and touch icons --> + <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png"> + <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png"> + <link rel="shortcut icon" href="bootstrap/ico/favicon.png"> + </head> + + <body> + + <div class="navbar navbar-inverse navbar-fixed-top"> + <div class="navbar-inner"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="brand" href="/">Joshua</a> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="index.html">Documentation</a></li> + <li><a href="pipeline.html">Pipeline</a></li> + <li><a href="tutorial.html">Tutorial</a></li> + <li><a href="decoder.html">Decoder</a></li> + <li><a href="thrax.html">Thrax</a></li> + <li><a href="file-formats.html">File formats</a></li> + <!-- <li><a href="advanced.html">Advanced</a></li> --> + <li><a href="faq.html">FAQ</a></li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + + <div class="container"> + + <div class="row"> + <div class="span2"> + <img src="/images/joshua-logo-small.png" + alt="Joshua logo (picture of a Joshua tree)" /> + </div> + <div class="span10"> + <h1>Joshua Documentation</h1> + <h2>Grammar extraction with Thrax</h2> + <span id="download"> + <a href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz">Download</a> + </span> + (version 5.0, released 16 August 2013) + </div> + </div> + + <hr /> + + <div class="row"> + <div class="span8"> + + <p>One day, this will hold Thrax documentation, including how to use Thrax, how to do grammar +filtering, and details on the configuration file options. It will also include details about our +experience setting up and maintaining Hadoop cluster installations, knowledge wrought of hard-fought +sweat and tears.</p> + +<p>In the meantime, please bother <a href="http://cs.jhu.edu/~jonny/">Jonny Weese</a> if there is something you +need to do that you donât understand. You might also be able to dig up some information <a href="http://cs.jhu.edu/~jonny/thrax/">on the old +Thrax page</a>.</p> + + + </div> + </div> + </div> <!-- /container --> + + <!-- Le javascript + ================================================== --> + <!-- Placed at the end of the document so the pages load faster --> + <script src="bootstrap/js/jquery.js"></script> + <script src="bootstrap/js/bootstrap-transition.js"></script> + <script src="bootstrap/js/bootstrap-alert.js"></script> + <script src="bootstrap/js/bootstrap-modal.js"></script> + <script src="bootstrap/js/bootstrap-dropdown.js"></script> + <script src="bootstrap/js/bootstrap-scrollspy.js"></script> + <script src="bootstrap/js/bootstrap-tab.js"></script> + <script src="bootstrap/js/bootstrap-tooltip.js"></script> + <script src="bootstrap/js/bootstrap-popover.js"></script> + <script src="bootstrap/js/bootstrap-button.js"></script> + <script src="bootstrap/js/bootstrap-collapse.js"></script> + <script src="bootstrap/js/bootstrap-carousel.js"></script> + <script src="bootstrap/js/bootstrap-typeahead.js"></script> + + <!-- Start of StatCounter Code for Default Guide --> + <script type="text/javascript"> + var sc_project=8264132; + var sc_invisible=1; + var sc_security="4b97fe2d"; + </script> + <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script> + <noscript> + <div class="statcounter"> + <a title="hit counter joomla" + href="http://statcounter.com/joomla/" + target="_blank"> + <img class="statcounter" + src="http://c.statcounter.com/8264132/0/4b97fe2d/1/" + alt="hit counter joomla" /> + </a> + </div> + </noscript> + <!-- End of StatCounter Code for Default Guide --> + + </body> +</html> http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/thrax.md ---------------------------------------------------------------------- diff --git a/5.0/thrax.md b/5.0/thrax.md deleted file mode 100644 index a904b23..0000000 --- a/5.0/thrax.md +++ /dev/null @@ -1,14 +0,0 @@ ---- -layout: default -category: advanced -title: Grammar extraction with Thrax ---- - -One day, this will hold Thrax documentation, including how to use Thrax, how to do grammar -filtering, and details on the configuration file options. It will also include details about our -experience setting up and maintaining Hadoop cluster installations, knowledge wrought of hard-fought -sweat and tears. - -In the meantime, please bother [Jonny Weese](http://cs.jhu.edu/~jonny/) if there is something you -need to do that you don't understand. You might also be able to dig up some information [on the old -Thrax page](http://cs.jhu.edu/~jonny/thrax/). http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/tms.html ---------------------------------------------------------------------- diff --git a/5.0/tms.html b/5.0/tms.html new file mode 100644 index 0000000..2138073 --- /dev/null +++ b/5.0/tms.html @@ -0,0 +1,290 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <title>Joshua Documentation | Building Translation Models</title> + <meta name="viewport" content="width=device-width, initial-scale=1.0"> + <meta name="description" content=""> + <meta name="author" content=""> + + <!-- Le styles --> + <link href="/bootstrap/css/bootstrap.css" rel="stylesheet"> + <style> + body { + padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */ + } + #download { + background-color: green; + font-size: 14pt; + font-weight: bold; + text-align: center; + color: white; + border-radius: 5px; + padding: 4px; + } + + #download a:link { + color: white; + } + + #download a:hover { + color: lightgrey; + } + + #download a:visited { + color: white; + } + + a.pdf { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: brown; + padding: 2px; + } + + a.bibtex { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: orange; + padding: 2px; + } + + img.sponsor { + height: 120px; + margin: 5px; + } + </style> + <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet"> + + <!-- HTML5 shim, for IE6-8 support of HTML5 elements --> + <!--[if lt IE 9]> + <script src="bootstrap/js/html5shiv.js"></script> + <![endif]--> + + <!-- Fav and touch icons --> + <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png"> + <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png"> + <link rel="shortcut icon" href="bootstrap/ico/favicon.png"> + </head> + + <body> + + <div class="navbar navbar-inverse navbar-fixed-top"> + <div class="navbar-inner"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="brand" href="/">Joshua</a> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="index.html">Documentation</a></li> + <li><a href="pipeline.html">Pipeline</a></li> + <li><a href="tutorial.html">Tutorial</a></li> + <li><a href="decoder.html">Decoder</a></li> + <li><a href="thrax.html">Thrax</a></li> + <li><a href="file-formats.html">File formats</a></li> + <!-- <li><a href="advanced.html">Advanced</a></li> --> + <li><a href="faq.html">FAQ</a></li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + + <div class="container"> + + <div class="row"> + <div class="span2"> + <img src="/images/joshua-logo-small.png" + alt="Joshua logo (picture of a Joshua tree)" /> + </div> + <div class="span10"> + <h1>Joshua Documentation</h1> + <h2>Building Translation Models</h2> + <span id="download"> + <a href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz">Download</a> + </span> + (version 5.0, released 16 August 2013) + </div> + </div> + + <hr /> + + <div class="row"> + <div class="span8"> + + <h1 id="build-a-translation-model">Build a translation model</h1> + +<p>Extracting a grammar from a large amount of data is a multi-step process. The first requirement is parallel data. The Europarl, Call Home, and Fisher corpora all contain parallel translations of Spanish and English sentences.</p> + +<p>We will copy (or symlink) the parallel source text files in a subdirectory called <code class="highlighter-rouge">input/</code>.</p> + +<p>Then, we concatenate all the training files on each side. The pipeline script normally does tokenization and normalization, but in this instance we have a custom tokenizer we need to apply to the source side, so we have to do it manually and then skip that step using the <code class="highlighter-rouge">pipeline.pl</code> option <code class="highlighter-rouge">--first-step alignment</code>.</p> + +<ul> + <li> + <p>to tokenize the English data, do</p> + + <table> + <tbody> + <tr> + <td>cat callhome.en europarl.en fisher.en > all.en</td> + <td>$JOSHUA/scripts/training/normalize-punctuation.pl en</td> + <td>$JOSHUA/scripts/training/penn-treebank-tokenizer.perl</td> + <td>$JOSHUA/scripts/lowercase.perl > all.norm.tok.lc.en</td> + </tr> + </tbody> + </table> + </li> +</ul> + +<p>The same can be done for the Spanish side of the input data:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>cat callhome.es europarl.es fisher.es > all.es | $JOSHUA/scripts/training/normalize-punctuation.pl es | $JOSHUA/scripts/training/penn-treebank-tokenizer.perl | $JOSHUA/scripts/lowercase.perl > all.norm.tok.lc.es +</code></pre> +</div> + +<p>By the way, an alternative tokenizer is a Twitter tokenizer found in the <a href="http://github.com/vandurme/jerboa">Jerboa</a> project.</p> + +<p>The final step in the training data preparation is to remove all examples in which either of the language sides is a blank line.</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>paste all.norm.tok.lc.es all.norm.tok.lc.en | grep -Pv "^\t|\t$" \ + | ./splittabs.pl all.norm.tok.lc.noblanks.es all.norm.tok.lc.noblanks.en +</code></pre> +</div> + +<p>contents of <code class="highlighter-rouge">splittabls.pl</code> by Matt Post:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code><span class="c1">#!/usr/bin/perl</span> + +<span class="c1"># splits on tab, printing respective chunks to the list of files given</span> +<span class="c1"># as script arguments</span> + +<span class="k">use</span> <span class="nv">FileHandle</span><span class="p">;</span> + +<span class="k">my</span> <span class="nv">@fh</span><span class="p">;</span> +<span class="vg">$|</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="c1"># don't buffer output</span> + +<span class="k">if</span> <span class="p">(</span><span class="nv">@ARGV</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> + <span class="k">print</span> <span class="s">"Usage: splittabs.pl < tabbed-file\n"</span><span class="p">;</span> + <span class="nb">exit</span><span class="p">;</span> +<span class="p">}</span> + +<span class="k">my</span> <span class="nv">@fh</span> <span class="o">=</span> <span class="nb">map</span> <span class="p">{</span> <span class="nv">get_filehandle</span><span class="p">(</span><span class="nv">$_</span><span class="p">)</span> <span class="p">}</span> <span class="nv">@ARGV</span><span class="p">;</span> +<span class="nv">@ARGV</span> <span class="o">=</span> <span class="p">();</span> + +<span class="k">while</span> <span class="p">(</span><span class="k">my</span> <span class="nv">$line</span> <span class="o">=</span> <span class="o"><></span><span class="p">)</span> <span class="p">{</span> + <span class="nb">chomp</span><span class="p">(</span><span class="nv">$line</span><span class="p">);</span> + <span class="k">my</span> <span class="p">(</span><span class="nv">@fields</span><span class="p">)</span> <span class="o">=</span> <span class="nb">split</span><span class="p">(</span><span class="sr">/\t/</span><span class="p">,</span><span class="nv">$line</span><span class="p">,</span><span class="nb">scalar</span> <span class="nv">@fh</span><span class="p">);</span> + + <span class="nb">map</span> <span class="p">{</span> <span class="k">print</span> <span class="p">{</span><span class="nv">$fh</span><span class="p">[</span><span class="nv">$_</span><span class="p">]}</span> <span class="s">"$fields[$_]\n"</span> <span class="p">}</span> <span class="p">(</span><span class="mi">0</span><span class="o">..</span><span class="nv">$#fields</span><span class="p">);</span> +<span class="p">}</span> + +<span class="k">sub </span><span class="nf">get_filehandle</span> <span class="p">{</span> + <span class="k">my</span> <span class="nv">$file</span> <span class="o">=</span> <span class="nb">shift</span><span class="p">;</span> + + <span class="k">if</span> <span class="p">(</span><span class="nv">$file</span> <span class="ow">eq</span> <span class="s">"-"</span><span class="p">)</span> <span class="p">{</span> + <span class="k">return</span> <span class="o">*</span><span class="bp">STDOUT</span><span class="p">;</span> + <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> + <span class="nb">local</span> <span class="o">*</span><span class="nv">FH</span><span class="p">;</span> + <span class="nb">open</span> <span class="nv">FH</span><span class="p">,</span> <span class="s">">$file"</span> <span class="ow">or</span> <span class="nb">die</span> <span class="s">"can't open '$file' for writing"</span><span class="p">;</span> + <span class="k">return</span> <span class="o">*</span><span class="nv">FH</span><span class="p">;</span> + <span class="p">}</span> +<span class="p">}</span> +</code></pre> +</div> + +<p>Now we can run the pipeline to extract the grammar. Run the following script:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code><span class="c">#!/bin/bash</span> + +<span class="c"># this creates a grammar</span> + +<span class="c"># NEED:</span> +<span class="c"># pair</span> +<span class="c"># type</span> + +<span class="nb">set</span> -u + +<span class="nv">pair</span><span class="o">=</span>es-en +<span class="nb">type</span><span class="o">=</span>hiero + +<span class="c">#. ~/.bashrc</span> + +<span class="c">#basedir=$(pwd)</span> + +<span class="nv">dir</span><span class="o">=</span>grammar-<span class="nv">$pair</span>-<span class="nv">$type</span> + +<span class="o">[[</span> ! -d <span class="nv">$dir</span> <span class="o">]]</span> <span class="o">&&</span> mkdir -p <span class="nv">$dir</span> +<span class="nb">cd</span> <span class="nv">$dir</span> + +<span class="nb">source</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span> <span class="nv">$pair</span> | cut -d- -f 1<span class="k">)</span> +<span class="nv">target</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span> <span class="nv">$pair</span> | cut -d- -f 2<span class="k">)</span> + +<span class="nv">$JOSHUA</span>/scripts/training/pipeline.pl <span class="se">\</span> + --source <span class="nv">$source</span> <span class="se">\</span> + --target <span class="nv">$target</span> <span class="se">\</span> + --corpus /home/hltcoe/lorland/expts/scale12/model1/input/all.norm.tok.lc.noblanks <span class="se">\</span> + --type <span class="nv">$type</span> <span class="se">\</span> + --joshua-mem 100g <span class="se">\</span> + --no-prepare <span class="se">\</span> + --first-step align <span class="se">\</span> + --last-step thrax <span class="se">\</span> + --hadoop <span class="nv">$HADOOP</span> <span class="se">\</span> + --threads 8 <span class="se">\</span> +</code></pre> +</div> + + + </div> + </div> + </div> <!-- /container --> + + <!-- Le javascript + ================================================== --> + <!-- Placed at the end of the document so the pages load faster --> + <script src="bootstrap/js/jquery.js"></script> + <script src="bootstrap/js/bootstrap-transition.js"></script> + <script src="bootstrap/js/bootstrap-alert.js"></script> + <script src="bootstrap/js/bootstrap-modal.js"></script> + <script src="bootstrap/js/bootstrap-dropdown.js"></script> + <script src="bootstrap/js/bootstrap-scrollspy.js"></script> + <script src="bootstrap/js/bootstrap-tab.js"></script> + <script src="bootstrap/js/bootstrap-tooltip.js"></script> + <script src="bootstrap/js/bootstrap-popover.js"></script> + <script src="bootstrap/js/bootstrap-button.js"></script> + <script src="bootstrap/js/bootstrap-collapse.js"></script> + <script src="bootstrap/js/bootstrap-carousel.js"></script> + <script src="bootstrap/js/bootstrap-typeahead.js"></script> + + <!-- Start of StatCounter Code for Default Guide --> + <script type="text/javascript"> + var sc_project=8264132; + var sc_invisible=1; + var sc_security="4b97fe2d"; + </script> + <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script> + <noscript> + <div class="statcounter"> + <a title="hit counter joomla" + href="http://statcounter.com/joomla/" + target="_blank"> + <img class="statcounter" + src="http://c.statcounter.com/8264132/0/4b97fe2d/1/" + alt="hit counter joomla" /> + </a> + </div> + </noscript> + <!-- End of StatCounter Code for Default Guide --> + + </body> +</html> http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/tms.md ---------------------------------------------------------------------- diff --git a/5.0/tms.md b/5.0/tms.md deleted file mode 100644 index 68f8732..0000000 --- a/5.0/tms.md +++ /dev/null @@ -1,106 +0,0 @@ ---- -layout: default -category: advanced -title: Building Translation Models ---- - -# Build a translation model - -Extracting a grammar from a large amount of data is a multi-step process. The first requirement is parallel data. The Europarl, Call Home, and Fisher corpora all contain parallel translations of Spanish and English sentences. - -We will copy (or symlink) the parallel source text files in a subdirectory called `input/`. - -Then, we concatenate all the training files on each side. The pipeline script normally does tokenization and normalization, but in this instance we have a custom tokenizer we need to apply to the source side, so we have to do it manually and then skip that step using the `pipeline.pl` option `--first-step alignment`. - -* to tokenize the English data, do - - cat callhome.en europarl.en fisher.en > all.en | $JOSHUA/scripts/training/normalize-punctuation.pl en | $JOSHUA/scripts/training/penn-treebank-tokenizer.perl | $JOSHUA/scripts/lowercase.perl > all.norm.tok.lc.en - -The same can be done for the Spanish side of the input data: - - cat callhome.es europarl.es fisher.es > all.es | $JOSHUA/scripts/training/normalize-punctuation.pl es | $JOSHUA/scripts/training/penn-treebank-tokenizer.perl | $JOSHUA/scripts/lowercase.perl > all.norm.tok.lc.es - -By the way, an alternative tokenizer is a Twitter tokenizer found in the [Jerboa](http://github.com/vandurme/jerboa) project. - -The final step in the training data preparation is to remove all examples in which either of the language sides is a blank line. - - paste all.norm.tok.lc.es all.norm.tok.lc.en | grep -Pv "^\t|\t$" \ - | ./splittabs.pl all.norm.tok.lc.noblanks.es all.norm.tok.lc.noblanks.en - -contents of `splittabls.pl` by Matt Post: - - #!/usr/bin/perl - - # splits on tab, printing respective chunks to the list of files given - # as script arguments - - use FileHandle; - - my @fh; - $| = 1; # don't buffer output - - if (@ARGV < 0) { - print "Usage: splittabs.pl < tabbed-file\n"; - exit; - } - - my @fh = map { get_filehandle($_) } @ARGV; - @ARGV = (); - - while (my $line = <>) { - chomp($line); - my (@fields) = split(/\t/,$line,scalar @fh); - - map { print {$fh[$_]} "$fields[$_]\n" } (0..$#fields); - } - - sub get_filehandle { - my $file = shift; - - if ($file eq "-") { - return *STDOUT; - } else { - local *FH; - open FH, ">$file" or die "can't open '$file' for writing"; - return *FH; - } - } - -Now we can run the pipeline to extract the grammar. Run the following script: - - #!/bin/bash - - # this creates a grammar - - # NEED: - # pair - # type - - set -u - - pair=es-en - type=hiero - - #. ~/.bashrc - - #basedir=$(pwd) - - dir=grammar-$pair-$type - - [[ ! -d $dir ]] && mkdir -p $dir - cd $dir - - source=$(echo $pair | cut -d- -f 1) - target=$(echo $pair | cut -d- -f 2) - - $JOSHUA/scripts/training/pipeline.pl \ - --source $source \ - --target $target \ - --corpus /home/hltcoe/lorland/expts/scale12/model1/input/all.norm.tok.lc.noblanks \ - --type $type \ - --joshua-mem 100g \ - --no-prepare \ - --first-step align \ - --last-step thrax \ - --hadoop $HADOOP \ - --threads 8 \ http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/tutorial.html ---------------------------------------------------------------------- diff --git a/5.0/tutorial.html b/5.0/tutorial.html new file mode 100644 index 0000000..59c0c40 --- /dev/null +++ b/5.0/tutorial.html @@ -0,0 +1,368 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <title>Joshua Documentation | Pipeline tutorial</title> + <meta name="viewport" content="width=device-width, initial-scale=1.0"> + <meta name="description" content=""> + <meta name="author" content=""> + + <!-- Le styles --> + <link href="/bootstrap/css/bootstrap.css" rel="stylesheet"> + <style> + body { + padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */ + } + #download { + background-color: green; + font-size: 14pt; + font-weight: bold; + text-align: center; + color: white; + border-radius: 5px; + padding: 4px; + } + + #download a:link { + color: white; + } + + #download a:hover { + color: lightgrey; + } + + #download a:visited { + color: white; + } + + a.pdf { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: brown; + padding: 2px; + } + + a.bibtex { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: orange; + padding: 2px; + } + + img.sponsor { + height: 120px; + margin: 5px; + } + </style> + <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet"> + + <!-- HTML5 shim, for IE6-8 support of HTML5 elements --> + <!--[if lt IE 9]> + <script src="bootstrap/js/html5shiv.js"></script> + <![endif]--> + + <!-- Fav and touch icons --> + <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png"> + <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png"> + <link rel="shortcut icon" href="bootstrap/ico/favicon.png"> + </head> + + <body> + + <div class="navbar navbar-inverse navbar-fixed-top"> + <div class="navbar-inner"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="brand" href="/">Joshua</a> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="index.html">Documentation</a></li> + <li><a href="pipeline.html">Pipeline</a></li> + <li><a href="tutorial.html">Tutorial</a></li> + <li><a href="decoder.html">Decoder</a></li> + <li><a href="thrax.html">Thrax</a></li> + <li><a href="file-formats.html">File formats</a></li> + <!-- <li><a href="advanced.html">Advanced</a></li> --> + <li><a href="faq.html">FAQ</a></li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + + <div class="container"> + + <div class="row"> + <div class="span2"> + <img src="/images/joshua-logo-small.png" + alt="Joshua logo (picture of a Joshua tree)" /> + </div> + <div class="span10"> + <h1>Joshua Documentation</h1> + <h2>Pipeline tutorial</h2> + <span id="download"> + <a href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz">Download</a> + </span> + (version 5.0, released 16 August 2013) + </div> + </div> + + <hr /> + + <div class="row"> + <div class="span8"> + + <p>This document will walk you through using the pipeline in a variety of scenarios. Once youâve gained a +sense for how the pipeline works, you can consult the <a href="pipeline.html">pipeline page</a> for a number of +other options available in the pipeline.</p> + +<h2 id="download-and-setup">Download and Setup</h2> + +<p>Download and install Joshua as described on the <a href="index.html">quick start page</a>, installing it under +<code class="highlighter-rouge">~/code/</code>. Once youâve done that, you should make sure you have the following environment variable set:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>export JOSHUA=$HOME/code/joshua-v5.0 +export JAVA_HOME=/usr/java/default +</code></pre> +</div> + +<p>If you have a Hadoop installation, make sure youâve set <code class="highlighter-rouge">$HADOOP</code> to point to it (if not, Joshua +will roll out a standalone cluster for you). If youâd like to use kbmira for tuning, you should also +install Moses, and define the environment variable <code class="highlighter-rouge">$MOSES</code> to point to the root of its installation.</p> + +<h2 id="a-basic-pipeline-run">A basic pipeline run</h2> + +<p>For todayâs experiments, weâll be building a BengaliâEnglish system using data included in the +<a href="/indian-parallel-corpora/">Indian Languages Parallel Corpora</a>. This data was collected by taking +the 100 most-popular Bengali Wikipedia pages and translating them into English using Amazonâs +<a href="http://www.mturk.com/">Mechanical Turk</a>. As a warning, many of these pages contain material that is +not typically found in machine translation tutorials.</p> + +<p>Download the data and install it somewhere:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>cd ~/data +wget -q --no-check -O indian-parallel-corpora.zip https://github.com/joshua-decoder/indian-parallel-corpora/archive/master.zip +unzip indian-parallel-corpora.zip +</code></pre> +</div> + +<p>Then define the environment variable <code class="highlighter-rouge">$INDIAN</code> to point to it:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>cd ~/data/indian-parallel-corpora-master +export INDIAN=$(pwd) +</code></pre> +</div> + +<h3 id="preparing-the-data">Preparing the data</h3> + +<p>Inside this tarball is a directory for each language pair. Within each language directory is another +directory named <code class="highlighter-rouge">tok/</code>, which contains pre-tokenized and normalized versions of the data. This was +done because the normalization scripts provided with Joshua is written in scripting languages that +often have problems properly handling UTF-8 character sets. We will be using these tokenized +versions, and preventing the pipeline from retokenizing using the <code class="highlighter-rouge">--no-prepare</code> flag.</p> + +<p>In <code class="highlighter-rouge">$INDIAN/bn-en/tok</code>, you should see the following files:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>$ ls $INDIAN/bn-en/tok +dev.bn-en.bn devtest.bn-en.bn dict.bn-en.bn test.bn-en.en.2 +dev.bn-en.en.0 devtest.bn-en.en.0 dict.bn-en.en test.bn-en.en.3 +dev.bn-en.en.1 devtest.bn-en.en.1 test.bn-en.bn training.bn-en.bn +dev.bn-en.en.2 devtest.bn-en.en.2 test.bn-en.en.0 training.bn-en.en +dev.bn-en.en.3 devtest.bn-en.en.3 test.bn-en.en.1 +</code></pre> +</div> + +<p>We will now use this data to test the complete pipeline with a single command.</p> + +<h3 id="run-the-pipeline">Run the pipeline</h3> + +<p>Create an experiments directory for containing your first experiment:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>mkdir ~/expts/joshua +cd ~/expts/joshua +</code></pre> +</div> + +<p>We will now create the baseline run, using a particular directory structure for experiments that +will allow us to take advantage of scripts provided with Joshua for displaying the results of many +related experiments.</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>cd ~/expts/joshua +$JOSHUA/bin/pipeline.pl \ + --rundir 1 \ + --readme "Baseline Hiero run" \ + --source bn \ + --target en \ + --corpus $INDIAN/bn-en/tok/training.bn-en \ + --corpus $INDIAN/bn-en/tok/dict.bn-en \ + --tune $INDIAN/bn-en/tok/dev.bn-en \ + --test $INDIAN/bn-en/tok/devtest.bn-en \ + --lm-order 3 +</code></pre> +</div> + +<p>This will start the pipeline building a BengaliâEnglish translation system constructed from the +training data and a dictionary, tuned against dev, and tested against devtest. It will use the +default values for most of the pipeline: <a href="https://code.google.com/p/giza-pp/">GIZA++</a> for alignment, +KenLMâs <code class="highlighter-rouge">lmplz</code> for building the language model, Z-MERT for tuning, KenLM with left-state +minimization for representing LM state in the decoder, and so on. We change the order of the n-gram +model to 3 (from its default of 5) because there is not enough data to build a 5-gram LM.</p> + +<p>A few notes:</p> + +<ul> + <li> + <p>This will likely take many hours to run, especially if you donât have a Hadoop cluster.</p> + </li> + <li> + <p>If you are running on Mac OS X, KenLMâs <code class="highlighter-rouge">lmplz</code> will not build due to the absence of static +libraries. In that case, you should add the flag <code class="highlighter-rouge">--lm-gen srilm</code> (recommended, if SRILM is +installed) or <code class="highlighter-rouge">--lm-gen berkeleylm</code>.</p> + </li> +</ul> + +<h3 id="variations">Variations</h3> + +<p>Once that is finished, you will have a baseline model. From there, you might wish to try variations +of the baseline model. Here are some examples of what you could vary:</p> + +<ul> + <li> + <p>Build an SAMT model (<code class="highlighter-rouge">--type samt</code>), GKHM model (<code class="highlighter-rouge">--type ghkm</code>), or phrasal ITG model (<code class="highlighter-rouge">--type phrasal</code>) </p> + </li> + <li> + <p>Use the Berkeley aligner instead of GIZA++ (<code class="highlighter-rouge">--aligner berkeley</code>)</p> + </li> + <li> + <p>Build the language model with BerkeleyLM (<code class="highlighter-rouge">--lm-gen srilm</code>) instead of KenLM (the default)</p> + </li> + <li> + <p>Change the order of the LM from the default of 5 (<code class="highlighter-rouge">--lm-order 4</code>)</p> + </li> + <li> + <p>Tune with MIRA instead of MERT (<code class="highlighter-rouge">--tuner mira</code>). This requires that Moses is installed.</p> + </li> + <li> + <p>Decode with a wider beam (<code class="highlighter-rouge">--joshua-args '-pop-limit 200'</code>) (the default is 100)</p> + </li> + <li> + <p>Add the provided BN-EN dictionary to the training data (add another <code class="highlighter-rouge">--corpus</code> line, e.g., <code class="highlighter-rouge">--corpus $INDIAN/bn-en/dict.bn-en</code>)</p> + </li> +</ul> + +<p>To do this, we will create new runs that partially reuse the results of previous runs. This is +possible by doing two things: (1) incrementing the run directory and providing an updated README +note; (2) telling the pipeline which of the many steps of the pipeline to begin at; and (3) +providing the needed dependencies.</p> + +<h1 id="a-second-run">A second run</h1> + +<p>Letâs begin by changing the tuner, to see what effect that has. To do so, we change the run +directory, tell the pipeline to start at the tuning step, and provide the needed dependencies:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/pipeline.pl \ + --rundir 2 \ + --readme "Tuning with MIRA" \ + --source bn \ + --target en \ + --corpus $INDIAN/bn-en/tok/training.bn-en \ + --tune $INDIAN/bn-en/tok/dev.bn-en \ + --test $INDIAN/bn-en/tok/devtest.bn-en \ + --first-step tune \ + --tuner mira \ + --grammar 1/grammar.gz \ + --no-corpus-lm \ + --lmfile 1/lm.gz +</code></pre> +</div> + +<p>Here, we have essentially the same invocation, but we have told the pipeline to use a different + MIRA, to start with tuning, and have provided it with the language model file and grammar it needs + to execute the tuning step. </p> + +<p>Note that we have also told it not to build a language model. This is necessary because the + pipeline always builds an LM on the target side of the training data, if provided, but we are + supplying the language model that was already built. We could equivalently have removed the + <code class="highlighter-rouge">--corpus</code> line.</p> + +<h2 id="changing-the-model-type">Changing the model type</h2> + +<p>Letâs compare the Hiero model weâve already built to an SAMT model. We have to reextract the +grammar, but can reuse the alignments and the language model:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>$JOSHUA/bin/pipeline.pl \ + --rundir 3 \ + --readme "Baseline SAMT model" \ + --source bn \ + --target en \ + --corpus $INDIAN/bn-en/tok/training.bn-en \ + --tune $INDIAN/bn-en/tok/dev.bn-en \ + --test $INDIAN/bn-en/tok/devtest.bn-en \ + --alignment 1/alignments/training.align \ + --first-step parse \ + --no-corpus-lm \ + --lmfile 1/lm.gz +</code></pre> +</div> + +<p>See <a href="pipeline.html#steps">the pipeline script page</a> for a list of all the steps.</p> + +<h2 id="analyzing-the-results">Analyzing the results</h2> + +<p>We now have three runs, in subdirectories 1, 2, and 3. We can display summary results from them +using the <code class="highlighter-rouge">$JOSHUA/scripts/training/summarize.pl</code> script.</p> + + + </div> + </div> + </div> <!-- /container --> + + <!-- Le javascript + ================================================== --> + <!-- Placed at the end of the document so the pages load faster --> + <script src="bootstrap/js/jquery.js"></script> + <script src="bootstrap/js/bootstrap-transition.js"></script> + <script src="bootstrap/js/bootstrap-alert.js"></script> + <script src="bootstrap/js/bootstrap-modal.js"></script> + <script src="bootstrap/js/bootstrap-dropdown.js"></script> + <script src="bootstrap/js/bootstrap-scrollspy.js"></script> + <script src="bootstrap/js/bootstrap-tab.js"></script> + <script src="bootstrap/js/bootstrap-tooltip.js"></script> + <script src="bootstrap/js/bootstrap-popover.js"></script> + <script src="bootstrap/js/bootstrap-button.js"></script> + <script src="bootstrap/js/bootstrap-collapse.js"></script> + <script src="bootstrap/js/bootstrap-carousel.js"></script> + <script src="bootstrap/js/bootstrap-typeahead.js"></script> + + <!-- Start of StatCounter Code for Default Guide --> + <script type="text/javascript"> + var sc_project=8264132; + var sc_invisible=1; + var sc_security="4b97fe2d"; + </script> + <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script> + <noscript> + <div class="statcounter"> + <a title="hit counter joomla" + href="http://statcounter.com/joomla/" + target="_blank"> + <img class="statcounter" + src="http://c.statcounter.com/8264132/0/4b97fe2d/1/" + alt="hit counter joomla" /> + </a> + </div> + </noscript> + <!-- End of StatCounter Code for Default Guide --> + + </body> +</html> http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/tutorial.md ---------------------------------------------------------------------- diff --git a/5.0/tutorial.md b/5.0/tutorial.md deleted file mode 100644 index 038db9f..0000000 --- a/5.0/tutorial.md +++ /dev/null @@ -1,174 +0,0 @@ ---- -layout: default -category: links -title: Pipeline tutorial ---- - -This document will walk you through using the pipeline in a variety of scenarios. Once you've gained a -sense for how the pipeline works, you can consult the [pipeline page](pipeline.html) for a number of -other options available in the pipeline. - -## Download and Setup - -Download and install Joshua as described on the [quick start page](index.html), installing it under -`~/code/`. Once you've done that, you should make sure you have the following environment variable set: - - export JOSHUA=$HOME/code/joshua-v5.0 - export JAVA_HOME=/usr/java/default - -If you have a Hadoop installation, make sure you've set `$HADOOP` to point to it (if not, Joshua -will roll out a standalone cluster for you). If you'd like to use kbmira for tuning, you should also -install Moses, and define the environment variable `$MOSES` to point to the root of its installation. - -## A basic pipeline run - -For today's experiments, we'll be building a Bengali--English system using data included in the -[Indian Languages Parallel Corpora](/indian-parallel-corpora/). This data was collected by taking -the 100 most-popular Bengali Wikipedia pages and translating them into English using Amazon's -[Mechanical Turk](http://www.mturk.com/). As a warning, many of these pages contain material that is -not typically found in machine translation tutorials. - -Download the data and install it somewhere: - - cd ~/data - wget -q --no-check -O indian-parallel-corpora.zip https://github.com/joshua-decoder/indian-parallel-corpora/archive/master.zip - unzip indian-parallel-corpora.zip - -Then define the environment variable `$INDIAN` to point to it: - - cd ~/data/indian-parallel-corpora-master - export INDIAN=$(pwd) - -### Preparing the data - -Inside this tarball is a directory for each language pair. Within each language directory is another -directory named `tok/`, which contains pre-tokenized and normalized versions of the data. This was -done because the normalization scripts provided with Joshua is written in scripting languages that -often have problems properly handling UTF-8 character sets. We will be using these tokenized -versions, and preventing the pipeline from retokenizing using the `--no-prepare` flag. - -In `$INDIAN/bn-en/tok`, you should see the following files: - - $ ls $INDIAN/bn-en/tok - dev.bn-en.bn devtest.bn-en.bn dict.bn-en.bn test.bn-en.en.2 - dev.bn-en.en.0 devtest.bn-en.en.0 dict.bn-en.en test.bn-en.en.3 - dev.bn-en.en.1 devtest.bn-en.en.1 test.bn-en.bn training.bn-en.bn - dev.bn-en.en.2 devtest.bn-en.en.2 test.bn-en.en.0 training.bn-en.en - dev.bn-en.en.3 devtest.bn-en.en.3 test.bn-en.en.1 - -We will now use this data to test the complete pipeline with a single command. - -### Run the pipeline - -Create an experiments directory for containing your first experiment: - - mkdir ~/expts/joshua - cd ~/expts/joshua - -We will now create the baseline run, using a particular directory structure for experiments that -will allow us to take advantage of scripts provided with Joshua for displaying the results of many -related experiments. - - cd ~/expts/joshua - $JOSHUA/bin/pipeline.pl \ - --rundir 1 \ - --readme "Baseline Hiero run" \ - --source bn \ - --target en \ - --corpus $INDIAN/bn-en/tok/training.bn-en \ - --corpus $INDIAN/bn-en/tok/dict.bn-en \ - --tune $INDIAN/bn-en/tok/dev.bn-en \ - --test $INDIAN/bn-en/tok/devtest.bn-en \ - --lm-order 3 - -This will start the pipeline building a Bengali--English translation system constructed from the -training data and a dictionary, tuned against dev, and tested against devtest. It will use the -default values for most of the pipeline: [GIZA++](https://code.google.com/p/giza-pp/) for alignment, -KenLM's `lmplz` for building the language model, Z-MERT for tuning, KenLM with left-state -minimization for representing LM state in the decoder, and so on. We change the order of the n-gram -model to 3 (from its default of 5) because there is not enough data to build a 5-gram LM. - -A few notes: - -- This will likely take many hours to run, especially if you don't have a Hadoop cluster. - -- If you are running on Mac OS X, KenLM's `lmplz` will not build due to the absence of static - libraries. In that case, you should add the flag `--lm-gen srilm` (recommended, if SRILM is - installed) or `--lm-gen berkeleylm`. - -### Variations - -Once that is finished, you will have a baseline model. From there, you might wish to try variations -of the baseline model. Here are some examples of what you could vary: - -- Build an SAMT model (`--type samt`), GKHM model (`--type ghkm`), or phrasal ITG model (`--type phrasal`) - -- Use the Berkeley aligner instead of GIZA++ (`--aligner berkeley`) - -- Build the language model with BerkeleyLM (`--lm-gen srilm`) instead of KenLM (the default) - -- Change the order of the LM from the default of 5 (`--lm-order 4`) - -- Tune with MIRA instead of MERT (`--tuner mira`). This requires that Moses is installed. - -- Decode with a wider beam (`--joshua-args '-pop-limit 200'`) (the default is 100) - -- Add the provided BN-EN dictionary to the training data (add another `--corpus` line, e.g., `--corpus $INDIAN/bn-en/dict.bn-en`) - -To do this, we will create new runs that partially reuse the results of previous runs. This is -possible by doing two things: (1) incrementing the run directory and providing an updated README -note; (2) telling the pipeline which of the many steps of the pipeline to begin at; and (3) -providing the needed dependencies. - -# A second run - -Let's begin by changing the tuner, to see what effect that has. To do so, we change the run -directory, tell the pipeline to start at the tuning step, and provide the needed dependencies: - - $JOSHUA/bin/pipeline.pl \ - --rundir 2 \ - --readme "Tuning with MIRA" \ - --source bn \ - --target en \ - --corpus $INDIAN/bn-en/tok/training.bn-en \ - --tune $INDIAN/bn-en/tok/dev.bn-en \ - --test $INDIAN/bn-en/tok/devtest.bn-en \ - --first-step tune \ - --tuner mira \ - --grammar 1/grammar.gz \ - --no-corpus-lm \ - --lmfile 1/lm.gz - - Here, we have essentially the same invocation, but we have told the pipeline to use a different - MIRA, to start with tuning, and have provided it with the language model file and grammar it needs - to execute the tuning step. - - Note that we have also told it not to build a language model. This is necessary because the - pipeline always builds an LM on the target side of the training data, if provided, but we are - supplying the language model that was already built. We could equivalently have removed the - `--corpus` line. - -## Changing the model type - -Let's compare the Hiero model we've already built to an SAMT model. We have to reextract the -grammar, but can reuse the alignments and the language model: - - $JOSHUA/bin/pipeline.pl \ - --rundir 3 \ - --readme "Baseline SAMT model" \ - --source bn \ - --target en \ - --corpus $INDIAN/bn-en/tok/training.bn-en \ - --tune $INDIAN/bn-en/tok/dev.bn-en \ - --test $INDIAN/bn-en/tok/devtest.bn-en \ - --alignment 1/alignments/training.align \ - --first-step parse \ - --no-corpus-lm \ - --lmfile 1/lm.gz - -See [the pipeline script page](pipeline.html#steps) for a list of all the steps. - -## Analyzing the results - -We now have three runs, in subdirectories 1, 2, and 3. We can display summary results from them -using the `$JOSHUA/scripts/training/summarize.pl` script. http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/zmert.html ---------------------------------------------------------------------- diff --git a/5.0/zmert.html b/5.0/zmert.html new file mode 100644 index 0000000..c32ec6e --- /dev/null +++ b/5.0/zmert.html @@ -0,0 +1,252 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <title>Joshua Documentation | Z-MERT</title> + <meta name="viewport" content="width=device-width, initial-scale=1.0"> + <meta name="description" content=""> + <meta name="author" content=""> + + <!-- Le styles --> + <link href="/bootstrap/css/bootstrap.css" rel="stylesheet"> + <style> + body { + padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */ + } + #download { + background-color: green; + font-size: 14pt; + font-weight: bold; + text-align: center; + color: white; + border-radius: 5px; + padding: 4px; + } + + #download a:link { + color: white; + } + + #download a:hover { + color: lightgrey; + } + + #download a:visited { + color: white; + } + + a.pdf { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: brown; + padding: 2px; + } + + a.bibtex { + font-variant: small-caps; + /* font-weight: bold; */ + font-size: 10pt; + color: white; + background: orange; + padding: 2px; + } + + img.sponsor { + height: 120px; + margin: 5px; + } + </style> + <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet"> + + <!-- HTML5 shim, for IE6-8 support of HTML5 elements --> + <!--[if lt IE 9]> + <script src="bootstrap/js/html5shiv.js"></script> + <![endif]--> + + <!-- Fav and touch icons --> + <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png"> + <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png"> + <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png"> + <link rel="shortcut icon" href="bootstrap/ico/favicon.png"> + </head> + + <body> + + <div class="navbar navbar-inverse navbar-fixed-top"> + <div class="navbar-inner"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="brand" href="/">Joshua</a> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="index.html">Documentation</a></li> + <li><a href="pipeline.html">Pipeline</a></li> + <li><a href="tutorial.html">Tutorial</a></li> + <li><a href="decoder.html">Decoder</a></li> + <li><a href="thrax.html">Thrax</a></li> + <li><a href="file-formats.html">File formats</a></li> + <!-- <li><a href="advanced.html">Advanced</a></li> --> + <li><a href="faq.html">FAQ</a></li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + + <div class="container"> + + <div class="row"> + <div class="span2"> + <img src="/images/joshua-logo-small.png" + alt="Joshua logo (picture of a Joshua tree)" /> + </div> + <div class="span10"> + <h1>Joshua Documentation</h1> + <h2>Z-MERT</h2> + <span id="download"> + <a href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz">Download</a> + </span> + (version 5.0, released 16 August 2013) + </div> + </div> + + <hr /> + + <div class="row"> + <div class="span8"> + + <p>This document describes how to manually run the ZMERT module. ZMERT is Joshuaâs minimum error-rate +training module, written by Omar F. Zaidan. It is easily adapted to drop in different decoders, and +was also written so as to work with different objective functions (other than BLEU).</p> + +<p>((Section (1) in <code class="highlighter-rouge">$JOSHUA/examples/ZMERT/README_ZMERT.txt</code> is an expanded version of this section))</p> + +<p>Z-MERT, can be used by launching the driver program (<code class="highlighter-rouge">ZMERT.java</code>), which expects a config file as +its main argument. This config file can be used to specify any subset of Z-MERTâs 20-some +parameters. For a full list of those parameters, and their default values, run ZMERT with a single +-h argument as follows:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>java -cp $JOSHUA/bin joshua.zmert.ZMERT -h +</code></pre> +</div> + +<p>So what does a Z-MERT config file look like?</p> + +<p>Examine the file <code class="highlighter-rouge">examples/ZMERT/ZMERT_config_ex2.txt</code>. You will find that it +specifies the following âmainâ MERT parameters:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>(*) -dir dirPrefix: working directory +(*) -s sourceFile: source sentences (foreign sentences) of the MERT dataset +(*) -r refFile: target sentences (reference translations) of the MERT dataset +(*) -rps refsPerSen: number of reference translations per sentence +(*) -p paramsFile: file containing parameter names, initial values, and ranges +(*) -maxIt maxMERTIts: maximum number of MERT iterations +(*) -ipi initsPerIt: number of intermediate initial points per iteration +(*) -cmd commandFile: name of file containing commands to run the decoder +(*) -decOut decoderOutFile: name of the output file produced by the decoder +(*) -dcfg decConfigFile: name of decoder config file +(*) -N N: size of N-best list (per sentence) generated in each MERT iteration +(*) -v verbosity: output verbosity level (0-2; higher value => more verbose) +(*) -seed seed: seed used to initialize the random number generator +</code></pre> +</div> + +<p>(Note that the <code class="highlighter-rouge">-s</code> parameter is only used if Z-MERT is running Joshua as an + internal decoder. If Joshua is run as an external decoder, as is the case in + this README, then this parameter is ignored.)</p> + +<p>To test Z-MERT on the 100-sentence test set of example2, provide this config +file to Z-MERT as follows:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>java -cp bin joshua.zmert.ZMERT -maxMem 500 examples/ZMERT/ZMERT_config_ex2.txt > examples/ZMERT/ZMERT_example/ZMERT.out +</code></pre> +</div> + +<p>This will run Z-MERT for a couple of iterations on the data from the example2 +folder. (Notice that we have made copies of the source and reference files +from example2 and renamed them as src.txt and ref.* in the MERT_example folder, +just to have all the files needed by Z-MERT in one place.) Once the Z-MERT run +is complete, you should be able to inspect the log file to see what kinds of +things it did. If everything goes well, the run should take a few minutes, of +which more than 95% is time spent by Z-MERT waiting on Joshua to finish +decoding the sentences (once per iteration).</p> + +<p>The output file you get should be equivalent to <code class="highlighter-rouge">ZMERT.out.verbosity1</code>. If you +rerun the experiment with the verbosity (-v) argument set to 2 instead of 1, +the output file you get should be equivalent to <code class="highlighter-rouge">ZMERT.out.verbosity2</code>, which has +more interesting details about what Z-MERT does.</p> + +<p>Notice the additional <code class="highlighter-rouge">-maxMem</code> argument. It tells Z-MERT that it should not +persist to use up memory while the decoder is running (during which time Z-MERT +would be idle). The 500 tells Z-MERT that it can only use a maximum of 500 MB. +For more details on this issue, see section (4) in Z-MERTâs README.</p> + +<p>A quick note about Z-MERTâs interaction with the decoder. If you examine the +file <code class="highlighter-rouge">decoder_command_ex2.txt</code>, which is provided as the commandFile (<code class="highlighter-rouge">-cmd</code>) +argument in Z-MERTâs config file, youâll find it contains the command one would +use to run the decoder. Z-MERT launches the commandFile as an external +process, and assumes that it will launch the decoder to produce translations. +(Make sure that commandFile is executable.) After launching this external +process, Z-MERT waits for it to finish, then uses the resulting output file for +parameter tuning (in addition to the output files from previous iterations). +The command file here only has a single command, but your command file could +have multiple lines. Just make sure the command file itself is executable.</p> + +<p>Notice that the Z-MERT arguments <code class="highlighter-rouge">configFile</code> and <code class="highlighter-rouge">decoderOutFile</code> (<code class="highlighter-rouge">-cfg</code> and +<code class="highlighter-rouge">-decOut</code>) must match the two Joshua arguments in the commandFileâs (<code class="highlighter-rouge">-cmd</code>) single +command. Also, the Z-MERT argument for N must match the value for <code class="highlighter-rouge">top_n</code> in +Joshuaâs config file, indicated by the Z-MERT argument configFile (<code class="highlighter-rouge">-cfg</code>).</p> + +<p>For more details on Z-MERT, refer to <code class="highlighter-rouge">$JOSHUA/examples/ZMERT/README_ZMERT.txt</code></p> + + + </div> + </div> + </div> <!-- /container --> + + <!-- Le javascript + ================================================== --> + <!-- Placed at the end of the document so the pages load faster --> + <script src="bootstrap/js/jquery.js"></script> + <script src="bootstrap/js/bootstrap-transition.js"></script> + <script src="bootstrap/js/bootstrap-alert.js"></script> + <script src="bootstrap/js/bootstrap-modal.js"></script> + <script src="bootstrap/js/bootstrap-dropdown.js"></script> + <script src="bootstrap/js/bootstrap-scrollspy.js"></script> + <script src="bootstrap/js/bootstrap-tab.js"></script> + <script src="bootstrap/js/bootstrap-tooltip.js"></script> + <script src="bootstrap/js/bootstrap-popover.js"></script> + <script src="bootstrap/js/bootstrap-button.js"></script> + <script src="bootstrap/js/bootstrap-collapse.js"></script> + <script src="bootstrap/js/bootstrap-carousel.js"></script> + <script src="bootstrap/js/bootstrap-typeahead.js"></script> + + <!-- Start of StatCounter Code for Default Guide --> + <script type="text/javascript"> + var sc_project=8264132; + var sc_invisible=1; + var sc_security="4b97fe2d"; + </script> + <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script> + <noscript> + <div class="statcounter"> + <a title="hit counter joomla" + href="http://statcounter.com/joomla/" + target="_blank"> + <img class="statcounter" + src="http://c.statcounter.com/8264132/0/4b97fe2d/1/" + alt="hit counter joomla" /> + </a> + </div> + </noscript> + <!-- End of StatCounter Code for Default Guide --> + + </body> +</html> http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/zmert.md ---------------------------------------------------------------------- diff --git a/5.0/zmert.md b/5.0/zmert.md deleted file mode 100644 index d6a5d3c..0000000 --- a/5.0/zmert.md +++ /dev/null @@ -1,83 +0,0 @@ ---- -layout: default -category: advanced -title: Z-MERT ---- - -This document describes how to manually run the ZMERT module. ZMERT is Joshua's minimum error-rate -training module, written by Omar F. Zaidan. It is easily adapted to drop in different decoders, and -was also written so as to work with different objective functions (other than BLEU). - -((Section (1) in `$JOSHUA/examples/ZMERT/README_ZMERT.txt` is an expanded version of this section)) - -Z-MERT, can be used by launching the driver program (`ZMERT.java`), which expects a config file as -its main argument. This config file can be used to specify any subset of Z-MERT's 20-some -parameters. For a full list of those parameters, and their default values, run ZMERT with a single --h argument as follows: - - java -cp $JOSHUA/bin joshua.zmert.ZMERT -h - -So what does a Z-MERT config file look like? - -Examine the file `examples/ZMERT/ZMERT_config_ex2.txt`. You will find that it -specifies the following "main" MERT parameters: - - (*) -dir dirPrefix: working directory - (*) -s sourceFile: source sentences (foreign sentences) of the MERT dataset - (*) -r refFile: target sentences (reference translations) of the MERT dataset - (*) -rps refsPerSen: number of reference translations per sentence - (*) -p paramsFile: file containing parameter names, initial values, and ranges - (*) -maxIt maxMERTIts: maximum number of MERT iterations - (*) -ipi initsPerIt: number of intermediate initial points per iteration - (*) -cmd commandFile: name of file containing commands to run the decoder - (*) -decOut decoderOutFile: name of the output file produced by the decoder - (*) -dcfg decConfigFile: name of decoder config file - (*) -N N: size of N-best list (per sentence) generated in each MERT iteration - (*) -v verbosity: output verbosity level (0-2; higher value => more verbose) - (*) -seed seed: seed used to initialize the random number generator - -(Note that the `-s` parameter is only used if Z-MERT is running Joshua as an - internal decoder. If Joshua is run as an external decoder, as is the case in - this README, then this parameter is ignored.) - -To test Z-MERT on the 100-sentence test set of example2, provide this config -file to Z-MERT as follows: - - java -cp bin joshua.zmert.ZMERT -maxMem 500 examples/ZMERT/ZMERT_config_ex2.txt > examples/ZMERT/ZMERT_example/ZMERT.out - -This will run Z-MERT for a couple of iterations on the data from the example2 -folder. (Notice that we have made copies of the source and reference files -from example2 and renamed them as src.txt and ref.* in the MERT_example folder, -just to have all the files needed by Z-MERT in one place.) Once the Z-MERT run -is complete, you should be able to inspect the log file to see what kinds of -things it did. If everything goes well, the run should take a few minutes, of -which more than 95% is time spent by Z-MERT waiting on Joshua to finish -decoding the sentences (once per iteration). - -The output file you get should be equivalent to `ZMERT.out.verbosity1`. If you -rerun the experiment with the verbosity (-v) argument set to 2 instead of 1, -the output file you get should be equivalent to `ZMERT.out.verbosity2`, which has -more interesting details about what Z-MERT does. - -Notice the additional `-maxMem` argument. It tells Z-MERT that it should not -persist to use up memory while the decoder is running (during which time Z-MERT -would be idle). The 500 tells Z-MERT that it can only use a maximum of 500 MB. -For more details on this issue, see section (4) in Z-MERT's README. - -A quick note about Z-MERT's interaction with the decoder. If you examine the -file `decoder_command_ex2.txt`, which is provided as the commandFile (`-cmd`) -argument in Z-MERT's config file, you'll find it contains the command one would -use to run the decoder. Z-MERT launches the commandFile as an external -process, and assumes that it will launch the decoder to produce translations. -(Make sure that commandFile is executable.) After launching this external -process, Z-MERT waits for it to finish, then uses the resulting output file for -parameter tuning (in addition to the output files from previous iterations). -The command file here only has a single command, but your command file could -have multiple lines. Just make sure the command file itself is executable. - -Notice that the Z-MERT arguments `configFile` and `decoderOutFile` (`-cfg` and -`-decOut`) must match the two Joshua arguments in the commandFile's (`-cmd`) single -command. Also, the Z-MERT argument for N must match the value for `top_n` in -Joshua's config file, indicated by the Z-MERT argument configFile (`-cfg`). - -For more details on Z-MERT, refer to `$JOSHUA/examples/ZMERT/README_ZMERT.txt`
