[36/44] incubator-joshua-site git commit: First attempt

mjpost Fri, 08 Apr 2016 20:11:12 -0700

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/server.md
----------------------------------------------------------------------
diff --git a/5.0/server.md b/5.0/server.md
deleted file mode 100644
index 52b2a66..0000000
--- a/5.0/server.md
+++ /dev/null
@@ -1,30 +0,0 @@
----
-layout: default
-category: links
-title: Server mode
----
-
-The Joshua decoder can be run as a TCP/IP server instead of a POSIX-style 
command-line tool. Clients can concurrently connect to a socket and receive a 
set of newline-separated outputs for a set of newline-separated inputs.
-
-Threading takes place both within and across requests.  Threads from the 
decoder pool are assigned in round-robin manner across requests, preventing 
starvation.
-
-
-# Invoking the server
-
-A running server is configured at invokation time. To start in server mode, 
run `joshua-decoder` with the option `-server-port [PORT]`. Additionally, the 
server can be configured in the same ways as when using the 
command-line-functionality.
-
-E.g.,
-
-    $JOSHUA/bin/joshua-decoder -server-port 10101 -mark-oovs false 
-output-format "%s" -threads 10
-
-## Using the server
-
-To test that the server is working, a set of inputs can be sent to the server 
from the command line. 
-
-The server, as configured in the example above, will then respond to requests 
on port 10101.  You can test it out with the `nc` utility:
-
-    wget -qO - http://cs.jhu.edu/~post/files/pg1023.txt | head -132 | tail -11 
| nc localhost 10101
-
-Since no model was loaded, this will just return the text to you as sent to 
the server.
-
-The `-server-port` option can also be used when creating a [bundled 
configuration](bundle.html) that will be run in server mode.


http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/thrax.html
----------------------------------------------------------------------
diff --git a/5.0/thrax.html b/5.0/thrax.html
new file mode 100644
index 0000000..cbbfdee
--- /dev/null
+++ b/5.0/thrax.html
@@ -0,0 +1,177 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8">
+    <title>Joshua Documentation | Grammar extraction with Thrax</title>
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <meta name="description" content="">
+    <meta name="author" content="">
+
+    <!-- Le styles -->
+    <link href="/bootstrap/css/bootstrap.css" rel="stylesheet">
+    <style>
+      body {
+        padding-top: 60px; /* 60px to make the container go all the way to the 
bottom of the topbar */
+      }
+      #download {
+          background-color: green;
+          font-size: 14pt;
+          font-weight: bold;
+          text-align: center;
+          color: white;
+          border-radius: 5px;
+          padding: 4px;
+      }
+
+      #download a:link {
+          color: white;
+      }
+
+      #download a:hover {
+          color: lightgrey;
+      }
+
+      #download a:visited {
+          color: white;
+      }
+
+      a.pdf {
+          font-variant: small-caps;
+          /* font-weight: bold; */
+          font-size: 10pt;
+          color: white;
+          background: brown;
+          padding: 2px;
+      }
+
+      a.bibtex {
+          font-variant: small-caps;
+          /* font-weight: bold; */
+          font-size: 10pt;
+          color: white;
+          background: orange;
+          padding: 2px;
+      }
+
+      img.sponsor {
+        height: 120px;
+        margin: 5px;
+      }
+    </style>
+    <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet">
+
+    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
+    <!--[if lt IE 9]>
+      <script src="bootstrap/js/html5shiv.js"></script>
+    <![endif]-->
+
+    <!-- Fav and touch icons -->
+    <link rel="apple-touch-icon-precomposed" sizes="144x144" 
href="bootstrap/ico/apple-touch-icon-144-precomposed.png">
+    <link rel="apple-touch-icon-precomposed" sizes="114x114" 
href="bootstrap/ico/apple-touch-icon-114-precomposed.png">
+      <link rel="apple-touch-icon-precomposed" sizes="72x72" 
href="bootstrap/ico/apple-touch-icon-72-precomposed.png">
+                    <link rel="apple-touch-icon-precomposed" 
href="bootstrap/ico/apple-touch-icon-57-precomposed.png">
+                                   <link rel="shortcut icon" 
href="bootstrap/ico/favicon.png">
+  </head>
+
+  <body>
+
+    <div class="navbar navbar-inverse navbar-fixed-top">
+      <div class="navbar-inner">
+        <div class="container">
+          <button type="button" class="btn btn-navbar" data-toggle="collapse" 
data-target=".nav-collapse">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <a class="brand" href="/">Joshua</a>
+          <div class="nav-collapse collapse">
+            <ul class="nav">
+              <li><a href="index.html">Documentation</a></li>
+              <li><a href="pipeline.html">Pipeline</a></li>
+              <li><a href="tutorial.html">Tutorial</a></li>
+              <li><a href="decoder.html">Decoder</a></li>
+              <li><a href="thrax.html">Thrax</a></li>
+              <li><a href="file-formats.html">File formats</a></li>
+              <!-- <li><a href="advanced.html">Advanced</a></li> -->
+              <li><a href="faq.html">FAQ</a></li>
+            </ul>
+          </div><!--/.nav-collapse -->
+        </div>
+      </div>
+    </div>
+
+    <div class="container">
+
+      <div class="row">
+        <div class="span2">
+          <img src="/images/joshua-logo-small.png" 
+               alt="Joshua logo (picture of a Joshua tree)" />
+        </div>
+        <div class="span10">
+          <h1>Joshua Documentation</h1>
+          <h2>Grammar extraction with Thrax</h2>
+          <span id="download">
+            <a 
href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz";>Download</a>
+          </span>
+          &nbsp; (version 5.0, released 16 August 2013)
+        </div>
+      </div>
+      
+      <hr />
+
+      <div class="row">
+        <div class="span8">
+
+          <p>One day, this will hold Thrax documentation, including how to use 
Thrax, how to do grammar
+filtering, and details on the configuration file options.  It will also 
include details about our
+experience setting up and maintaining Hadoop cluster installations, knowledge 
wrought of hard-fought
+sweat and tears.</p>
+
+<p>In the meantime, please bother <a href="http://cs.jhu.edu/~jonny/";>Jonny 
Weese</a> if there is something you
+need to do that you donât understand.  You might also be able to dig up some 
information <a href="http://cs.jhu.edu/~jonny/thrax/";>on the old
+Thrax page</a>.</p>
+
+
+        </div>
+      </div>
+    </div> <!-- /container -->
+
+    <!-- Le javascript
+    ================================================== -->
+    <!-- Placed at the end of the document so the pages load faster -->
+    <script src="bootstrap/js/jquery.js"></script>
+    <script src="bootstrap/js/bootstrap-transition.js"></script>
+    <script src="bootstrap/js/bootstrap-alert.js"></script>
+    <script src="bootstrap/js/bootstrap-modal.js"></script>
+    <script src="bootstrap/js/bootstrap-dropdown.js"></script>
+    <script src="bootstrap/js/bootstrap-scrollspy.js"></script>
+    <script src="bootstrap/js/bootstrap-tab.js"></script>
+    <script src="bootstrap/js/bootstrap-tooltip.js"></script>
+    <script src="bootstrap/js/bootstrap-popover.js"></script>
+    <script src="bootstrap/js/bootstrap-button.js"></script>
+    <script src="bootstrap/js/bootstrap-collapse.js"></script>
+    <script src="bootstrap/js/bootstrap-carousel.js"></script>
+    <script src="bootstrap/js/bootstrap-typeahead.js"></script>
+
+    <!-- Start of StatCounter Code for Default Guide -->
+    <script type="text/javascript">
+      var sc_project=8264132; 
+      var sc_invisible=1; 
+      var sc_security="4b97fe2d"; 
+    </script>
+    <script type="text/javascript" 
src="http://www.statcounter.com/counter/counter.js";></script>
+    <noscript>
+      <div class="statcounter">
+        <a title="hit counter joomla" 
+           href="http://statcounter.com/joomla/";
+           target="_blank">
+          <img class="statcounter"
+               src="http://c.statcounter.com/8264132/0/4b97fe2d/1/";
+               alt="hit counter joomla" />
+        </a>
+      </div>
+    </noscript>
+    <!-- End of StatCounter Code for Default Guide -->
+
+  </body>
+</html>

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/thrax.md
----------------------------------------------------------------------
diff --git a/5.0/thrax.md b/5.0/thrax.md
deleted file mode 100644
index a904b23..0000000
--- a/5.0/thrax.md
+++ /dev/null
@@ -1,14 +0,0 @@
----
-layout: default
-category: advanced
-title: Grammar extraction with Thrax
----
-
-One day, this will hold Thrax documentation, including how to use Thrax, how 
to do grammar
-filtering, and details on the configuration file options.  It will also 
include details about our
-experience setting up and maintaining Hadoop cluster installations, knowledge 
wrought of hard-fought
-sweat and tears.
-
-In the meantime, please bother [Jonny Weese](http://cs.jhu.edu/~jonny/) if 
there is something you
-need to do that you don't understand.  You might also be able to dig up some 
information [on the old
-Thrax page](http://cs.jhu.edu/~jonny/thrax/).

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/tms.html
----------------------------------------------------------------------
diff --git a/5.0/tms.html b/5.0/tms.html
new file mode 100644
index 0000000..2138073
--- /dev/null
+++ b/5.0/tms.html
@@ -0,0 +1,290 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8">
+    <title>Joshua Documentation | Building Translation Models</title>
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <meta name="description" content="">
+    <meta name="author" content="">
+
+    <!-- Le styles -->
+    <link href="/bootstrap/css/bootstrap.css" rel="stylesheet">
+    <style>
+      body {
+        padding-top: 60px; /* 60px to make the container go all the way to the 
bottom of the topbar */
+      }
+      #download {
+          background-color: green;
+          font-size: 14pt;
+          font-weight: bold;
+          text-align: center;
+          color: white;
+          border-radius: 5px;
+          padding: 4px;
+      }
+
+      #download a:link {
+          color: white;
+      }
+
+      #download a:hover {
+          color: lightgrey;
+      }
+
+      #download a:visited {
+          color: white;
+      }
+
+      a.pdf {
+          font-variant: small-caps;
+          /* font-weight: bold; */
+          font-size: 10pt;
+          color: white;
+          background: brown;
+          padding: 2px;
+      }
+
+      a.bibtex {
+          font-variant: small-caps;
+          /* font-weight: bold; */
+          font-size: 10pt;
+          color: white;
+          background: orange;
+          padding: 2px;
+      }
+
+      img.sponsor {
+        height: 120px;
+        margin: 5px;
+      }
+    </style>
+    <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet">
+
+    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
+    <!--[if lt IE 9]>
+      <script src="bootstrap/js/html5shiv.js"></script>
+    <![endif]-->
+
+    <!-- Fav and touch icons -->
+    <link rel="apple-touch-icon-precomposed" sizes="144x144" 
href="bootstrap/ico/apple-touch-icon-144-precomposed.png">
+    <link rel="apple-touch-icon-precomposed" sizes="114x114" 
href="bootstrap/ico/apple-touch-icon-114-precomposed.png">
+      <link rel="apple-touch-icon-precomposed" sizes="72x72" 
href="bootstrap/ico/apple-touch-icon-72-precomposed.png">
+                    <link rel="apple-touch-icon-precomposed" 
href="bootstrap/ico/apple-touch-icon-57-precomposed.png">
+                                   <link rel="shortcut icon" 
href="bootstrap/ico/favicon.png">
+  </head>
+
+  <body>
+
+    <div class="navbar navbar-inverse navbar-fixed-top">
+      <div class="navbar-inner">
+        <div class="container">
+          <button type="button" class="btn btn-navbar" data-toggle="collapse" 
data-target=".nav-collapse">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <a class="brand" href="/">Joshua</a>
+          <div class="nav-collapse collapse">
+            <ul class="nav">
+              <li><a href="index.html">Documentation</a></li>
+              <li><a href="pipeline.html">Pipeline</a></li>
+              <li><a href="tutorial.html">Tutorial</a></li>
+              <li><a href="decoder.html">Decoder</a></li>
+              <li><a href="thrax.html">Thrax</a></li>
+              <li><a href="file-formats.html">File formats</a></li>
+              <!-- <li><a href="advanced.html">Advanced</a></li> -->
+              <li><a href="faq.html">FAQ</a></li>
+            </ul>
+          </div><!--/.nav-collapse -->
+        </div>
+      </div>
+    </div>
+
+    <div class="container">
+
+      <div class="row">
+        <div class="span2">
+          <img src="/images/joshua-logo-small.png" 
+               alt="Joshua logo (picture of a Joshua tree)" />
+        </div>
+        <div class="span10">
+          <h1>Joshua Documentation</h1>
+          <h2>Building Translation Models</h2>
+          <span id="download">
+            <a 
href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz";>Download</a>
+          </span>
+          &nbsp; (version 5.0, released 16 August 2013)
+        </div>
+      </div>
+      
+      <hr />
+
+      <div class="row">
+        <div class="span8">
+
+          <h1 id="build-a-translation-model">Build a translation model</h1>
+
+<p>Extracting a grammar from a large amount of data is a multi-step process. 
The first requirement is parallel data. The Europarl, Call Home, and Fisher 
corpora all contain parallel translations of Spanish and English sentences.</p>
+
+<p>We will copy (or symlink) the parallel source text files in a subdirectory 
called <code class="highlighter-rouge">input/</code>.</p>
+
+<p>Then, we concatenate all the training files on each side. The pipeline 
script normally does tokenization and normalization, but in this instance we 
have a custom tokenizer we need to apply to the source side, so we have to do 
it manually and then skip that step using the <code 
class="highlighter-rouge">pipeline.pl</code> option <code 
class="highlighter-rouge">--first-step alignment</code>.</p>
+
+<ul>
+  <li>
+    <p>to tokenize the English data, do</p>
+
+    <table>
+      <tbody>
+        <tr>
+          <td>cat callhome.en europarl.en fisher.en &gt; all.en</td>
+          <td>$JOSHUA/scripts/training/normalize-punctuation.pl en</td>
+          <td>$JOSHUA/scripts/training/penn-treebank-tokenizer.perl</td>
+          <td>$JOSHUA/scripts/lowercase.perl &gt; all.norm.tok.lc.en</td>
+        </tr>
+      </tbody>
+    </table>
+  </li>
+</ul>
+
+<p>The same can be done for the Spanish side of the input data:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>cat callhome.es 
europarl.es fisher.es &gt; all.es | 
$JOSHUA/scripts/training/normalize-punctuation.pl es | 
$JOSHUA/scripts/training/penn-treebank-tokenizer.perl | 
$JOSHUA/scripts/lowercase.perl &gt; all.norm.tok.lc.es
+</code></pre>
+</div>
+
+<p>By the way, an alternative tokenizer is a Twitter tokenizer found in the <a 
href="http://github.com/vandurme/jerboa";>Jerboa</a> project.</p>
+
+<p>The final step in the training data preparation is to remove all examples 
in which either of the language sides is a blank line.</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>paste 
all.norm.tok.lc.es all.norm.tok.lc.en | grep -Pv "^\t|\t$" \
+  | ./splittabs.pl all.norm.tok.lc.noblanks.es all.norm.tok.lc.noblanks.en
+</code></pre>
+</div>
+
+<p>contents of <code class="highlighter-rouge">splittabls.pl</code> by Matt 
Post:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code><span 
class="c1">#!/usr/bin/perl</span>
+
+<span class="c1"># splits on tab, printing respective chunks to the list of 
files given</span>
+<span class="c1"># as script arguments</span>
+
+<span class="k">use</span> <span class="nv">FileHandle</span><span 
class="p">;</span>
+
+<span class="k">my</span> <span class="nv">@fh</span><span class="p">;</span>
+<span class="vg">$|</span> <span class="o">=</span> <span 
class="mi">1</span><span class="p">;</span>   <span class="c1"># don't buffer 
output</span>
+
+<span class="k">if</span> <span class="p">(</span><span 
class="nv">@ARGV</span> <span class="o">&lt;</span> <span 
class="mi">0</span><span class="p">)</span> <span class="p">{</span>
+  <span class="k">print</span> <span class="s">"Usage: splittabs.pl &lt; 
tabbed-file\n"</span><span class="p">;</span>
+  <span class="nb">exit</span><span class="p">;</span>
+<span class="p">}</span>
+
+<span class="k">my</span> <span class="nv">@fh</span> <span class="o">=</span> 
<span class="nb">map</span> <span class="p">{</span> <span 
class="nv">get_filehandle</span><span class="p">(</span><span 
class="nv">$_</span><span class="p">)</span> <span class="p">}</span> <span 
class="nv">@ARGV</span><span class="p">;</span>
+<span class="nv">@ARGV</span> <span class="o">=</span> <span 
class="p">();</span>
+
+<span class="k">while</span> <span class="p">(</span><span class="k">my</span> 
<span class="nv">$line</span> <span class="o">=</span> <span 
class="o">&lt;&gt;</span><span class="p">)</span> <span class="p">{</span>
+  <span class="nb">chomp</span><span class="p">(</span><span 
class="nv">$line</span><span class="p">);</span>
+  <span class="k">my</span> <span class="p">(</span><span 
class="nv">@fields</span><span class="p">)</span> <span class="o">=</span> 
<span class="nb">split</span><span class="p">(</span><span 
class="sr">/\t/</span><span class="p">,</span><span 
class="nv">$line</span><span class="p">,</span><span class="nb">scalar</span> 
<span class="nv">@fh</span><span class="p">);</span>
+
+  <span class="nb">map</span> <span class="p">{</span> <span 
class="k">print</span> <span class="p">{</span><span class="nv">$fh</span><span 
class="p">[</span><span class="nv">$_</span><span class="p">]}</span> <span 
class="s">"$fields[$_]\n"</span> <span class="p">}</span> <span 
class="p">(</span><span class="mi">0</span><span class="o">..</span><span 
class="nv">$#fields</span><span class="p">);</span>
+<span class="p">}</span>
+
+<span class="k">sub </span><span class="nf">get_filehandle</span> <span 
class="p">{</span>
+    <span class="k">my</span> <span class="nv">$file</span> <span 
class="o">=</span> <span class="nb">shift</span><span class="p">;</span>
+
+    <span class="k">if</span> <span class="p">(</span><span 
class="nv">$file</span> <span class="ow">eq</span> <span 
class="s">"-"</span><span class="p">)</span> <span class="p">{</span>
+        <span class="k">return</span> <span class="o">*</span><span 
class="bp">STDOUT</span><span class="p">;</span>
+    <span class="p">}</span> <span class="k">else</span> <span 
class="p">{</span>
+        <span class="nb">local</span> <span class="o">*</span><span 
class="nv">FH</span><span class="p">;</span>
+        <span class="nb">open</span> <span class="nv">FH</span><span 
class="p">,</span> <span class="s">"&gt;$file"</span> <span 
class="ow">or</span> <span class="nb">die</span> <span class="s">"can't open 
'$file' for writing"</span><span class="p">;</span>
+        <span class="k">return</span> <span class="o">*</span><span 
class="nv">FH</span><span class="p">;</span>
+    <span class="p">}</span>
+<span class="p">}</span>
+</code></pre>
+</div>
+
+<p>Now we can run the pipeline to extract the grammar. Run the following 
script:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code><span 
class="c">#!/bin/bash</span>
+
+<span class="c"># this creates a grammar</span>
+
+<span class="c"># NEED:</span>
+<span class="c"># pair</span>
+<span class="c"># type</span>
+
+<span class="nb">set</span> -u
+
+<span class="nv">pair</span><span class="o">=</span>es-en
+<span class="nb">type</span><span class="o">=</span>hiero
+
+<span class="c">#. ~/.bashrc</span>
+
+<span class="c">#basedir=$(pwd)</span>
+
+<span class="nv">dir</span><span class="o">=</span>grammar-<span 
class="nv">$pair</span>-<span class="nv">$type</span>
+
+<span class="o">[[</span> ! -d <span class="nv">$dir</span> <span 
class="o">]]</span> <span class="o">&amp;&amp;</span> mkdir -p <span 
class="nv">$dir</span>
+<span class="nb">cd</span> <span class="nv">$dir</span>
+
+<span class="nb">source</span><span class="o">=</span><span 
class="k">$(</span><span class="nb">echo</span> <span class="nv">$pair</span> | 
cut -d- -f 1<span class="k">)</span>
+<span class="nv">target</span><span class="o">=</span><span 
class="k">$(</span><span class="nb">echo</span> <span class="nv">$pair</span> | 
cut -d- -f 2<span class="k">)</span>
+
+<span class="nv">$JOSHUA</span>/scripts/training/pipeline.pl <span 
class="se">\</span>
+  --source <span class="nv">$source</span> <span class="se">\</span>
+  --target <span class="nv">$target</span> <span class="se">\</span>
+  --corpus 
/home/hltcoe/lorland/expts/scale12/model1/input/all.norm.tok.lc.noblanks <span 
class="se">\</span>
+  --type <span class="nv">$type</span> <span class="se">\</span>
+  --joshua-mem 100g <span class="se">\</span>
+  --no-prepare <span class="se">\</span>
+  --first-step align <span class="se">\</span>
+  --last-step thrax <span class="se">\</span>
+  --hadoop <span class="nv">$HADOOP</span> <span class="se">\</span>
+  --threads 8 <span class="se">\</span>
+</code></pre>
+</div>
+
+
+        </div>
+      </div>
+    </div> <!-- /container -->
+
+    <!-- Le javascript
+    ================================================== -->
+    <!-- Placed at the end of the document so the pages load faster -->
+    <script src="bootstrap/js/jquery.js"></script>
+    <script src="bootstrap/js/bootstrap-transition.js"></script>
+    <script src="bootstrap/js/bootstrap-alert.js"></script>
+    <script src="bootstrap/js/bootstrap-modal.js"></script>
+    <script src="bootstrap/js/bootstrap-dropdown.js"></script>
+    <script src="bootstrap/js/bootstrap-scrollspy.js"></script>
+    <script src="bootstrap/js/bootstrap-tab.js"></script>
+    <script src="bootstrap/js/bootstrap-tooltip.js"></script>
+    <script src="bootstrap/js/bootstrap-popover.js"></script>
+    <script src="bootstrap/js/bootstrap-button.js"></script>
+    <script src="bootstrap/js/bootstrap-collapse.js"></script>
+    <script src="bootstrap/js/bootstrap-carousel.js"></script>
+    <script src="bootstrap/js/bootstrap-typeahead.js"></script>
+
+    <!-- Start of StatCounter Code for Default Guide -->
+    <script type="text/javascript">
+      var sc_project=8264132; 
+      var sc_invisible=1; 
+      var sc_security="4b97fe2d"; 
+    </script>
+    <script type="text/javascript" 
src="http://www.statcounter.com/counter/counter.js";></script>
+    <noscript>
+      <div class="statcounter">
+        <a title="hit counter joomla" 
+           href="http://statcounter.com/joomla/";
+           target="_blank">
+          <img class="statcounter"
+               src="http://c.statcounter.com/8264132/0/4b97fe2d/1/";
+               alt="hit counter joomla" />
+        </a>
+      </div>
+    </noscript>
+    <!-- End of StatCounter Code for Default Guide -->
+
+  </body>
+</html>

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/tms.md
----------------------------------------------------------------------
diff --git a/5.0/tms.md b/5.0/tms.md
deleted file mode 100644
index 68f8732..0000000
--- a/5.0/tms.md
+++ /dev/null
@@ -1,106 +0,0 @@
----
-layout: default
-category: advanced
-title: Building Translation Models
----
-
-# Build a translation model
-
-Extracting a grammar from a large amount of data is a multi-step process. The 
first requirement is parallel data. The Europarl, Call Home, and Fisher corpora 
all contain parallel translations of Spanish and English sentences.
-
-We will copy (or symlink) the parallel source text files in a subdirectory 
called `input/`.
-
-Then, we concatenate all the training files on each side. The pipeline script 
normally does tokenization and normalization, but in this instance we have a 
custom tokenizer we need to apply to the source side, so we have to do it 
manually and then skip that step using the `pipeline.pl` option `--first-step 
alignment`.
-
-* to tokenize the English data, do
-
-    cat callhome.en europarl.en fisher.en > all.en | 
$JOSHUA/scripts/training/normalize-punctuation.pl en | 
$JOSHUA/scripts/training/penn-treebank-tokenizer.perl | 
$JOSHUA/scripts/lowercase.perl > all.norm.tok.lc.en
-
-The same can be done for the Spanish side of the input data:
-
-    cat callhome.es europarl.es fisher.es > all.es | 
$JOSHUA/scripts/training/normalize-punctuation.pl es | 
$JOSHUA/scripts/training/penn-treebank-tokenizer.perl | 
$JOSHUA/scripts/lowercase.perl > all.norm.tok.lc.es
-
-By the way, an alternative tokenizer is a Twitter tokenizer found in the 
[Jerboa](http://github.com/vandurme/jerboa) project.
-
-The final step in the training data preparation is to remove all examples in 
which either of the language sides is a blank line.
-
-    paste all.norm.tok.lc.es all.norm.tok.lc.en | grep -Pv "^\t|\t$" \
-      | ./splittabs.pl all.norm.tok.lc.noblanks.es all.norm.tok.lc.noblanks.en
-
-contents of `splittabls.pl` by Matt Post:
-
-    #!/usr/bin/perl
-
-    # splits on tab, printing respective chunks to the list of files given
-    # as script arguments
-
-    use FileHandle;
-
-    my @fh;
-    $| = 1;   # don't buffer output
-
-    if (@ARGV < 0) {
-      print "Usage: splittabs.pl < tabbed-file\n";
-      exit;
-    }
-
-    my @fh = map { get_filehandle($_) } @ARGV;
-    @ARGV = ();
-
-    while (my $line = <>) {
-      chomp($line);
-      my (@fields) = split(/\t/,$line,scalar @fh);
-
-      map { print {$fh[$_]} "$fields[$_]\n" } (0..$#fields);
-    }
-
-    sub get_filehandle {
-        my $file = shift;
-
-        if ($file eq "-") {
-            return *STDOUT;
-        } else {
-            local *FH;
-            open FH, ">$file" or die "can't open '$file' for writing";
-            return *FH;
-        }
-    }
-
-Now we can run the pipeline to extract the grammar. Run the following script:
-
-    #!/bin/bash
-
-    # this creates a grammar
-
-    # NEED:
-    # pair
-    # type
-
-    set -u
-
-    pair=es-en
-    type=hiero
-
-    #. ~/.bashrc
-
-    #basedir=$(pwd)
-
-    dir=grammar-$pair-$type
-
-    [[ ! -d $dir ]] && mkdir -p $dir
-    cd $dir
-
-    source=$(echo $pair | cut -d- -f 1)
-    target=$(echo $pair | cut -d- -f 2)
-
-    $JOSHUA/scripts/training/pipeline.pl \
-      --source $source \
-      --target $target \
-      --corpus 
/home/hltcoe/lorland/expts/scale12/model1/input/all.norm.tok.lc.noblanks \
-      --type $type \
-      --joshua-mem 100g \
-      --no-prepare \
-      --first-step align \
-      --last-step thrax \
-      --hadoop $HADOOP \
-      --threads 8 \

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/tutorial.html
----------------------------------------------------------------------
diff --git a/5.0/tutorial.html b/5.0/tutorial.html
new file mode 100644
index 0000000..59c0c40
--- /dev/null
+++ b/5.0/tutorial.html
@@ -0,0 +1,368 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8">
+    <title>Joshua Documentation | Pipeline tutorial</title>
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <meta name="description" content="">
+    <meta name="author" content="">
+
+    <!-- Le styles -->
+    <link href="/bootstrap/css/bootstrap.css" rel="stylesheet">
+    <style>
+      body {
+        padding-top: 60px; /* 60px to make the container go all the way to the 
bottom of the topbar */
+      }
+      #download {
+          background-color: green;
+          font-size: 14pt;
+          font-weight: bold;
+          text-align: center;
+          color: white;
+          border-radius: 5px;
+          padding: 4px;
+      }
+
+      #download a:link {
+          color: white;
+      }
+
+      #download a:hover {
+          color: lightgrey;
+      }
+
+      #download a:visited {
+          color: white;
+      }
+
+      a.pdf {
+          font-variant: small-caps;
+          /* font-weight: bold; */
+          font-size: 10pt;
+          color: white;
+          background: brown;
+          padding: 2px;
+      }
+
+      a.bibtex {
+          font-variant: small-caps;
+          /* font-weight: bold; */
+          font-size: 10pt;
+          color: white;
+          background: orange;
+          padding: 2px;
+      }
+
+      img.sponsor {
+        height: 120px;
+        margin: 5px;
+      }
+    </style>
+    <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet">
+
+    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
+    <!--[if lt IE 9]>
+      <script src="bootstrap/js/html5shiv.js"></script>
+    <![endif]-->
+
+    <!-- Fav and touch icons -->
+    <link rel="apple-touch-icon-precomposed" sizes="144x144" 
href="bootstrap/ico/apple-touch-icon-144-precomposed.png">
+    <link rel="apple-touch-icon-precomposed" sizes="114x114" 
href="bootstrap/ico/apple-touch-icon-114-precomposed.png">
+      <link rel="apple-touch-icon-precomposed" sizes="72x72" 
href="bootstrap/ico/apple-touch-icon-72-precomposed.png">
+                    <link rel="apple-touch-icon-precomposed" 
href="bootstrap/ico/apple-touch-icon-57-precomposed.png">
+                                   <link rel="shortcut icon" 
href="bootstrap/ico/favicon.png">
+  </head>
+
+  <body>
+
+    <div class="navbar navbar-inverse navbar-fixed-top">
+      <div class="navbar-inner">
+        <div class="container">
+          <button type="button" class="btn btn-navbar" data-toggle="collapse" 
data-target=".nav-collapse">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <a class="brand" href="/">Joshua</a>
+          <div class="nav-collapse collapse">
+            <ul class="nav">
+              <li><a href="index.html">Documentation</a></li>
+              <li><a href="pipeline.html">Pipeline</a></li>
+              <li><a href="tutorial.html">Tutorial</a></li>
+              <li><a href="decoder.html">Decoder</a></li>
+              <li><a href="thrax.html">Thrax</a></li>
+              <li><a href="file-formats.html">File formats</a></li>
+              <!-- <li><a href="advanced.html">Advanced</a></li> -->
+              <li><a href="faq.html">FAQ</a></li>
+            </ul>
+          </div><!--/.nav-collapse -->
+        </div>
+      </div>
+    </div>
+
+    <div class="container">
+
+      <div class="row">
+        <div class="span2">
+          <img src="/images/joshua-logo-small.png" 
+               alt="Joshua logo (picture of a Joshua tree)" />
+        </div>
+        <div class="span10">
+          <h1>Joshua Documentation</h1>
+          <h2>Pipeline tutorial</h2>
+          <span id="download">
+            <a 
href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz";>Download</a>
+          </span>
+          &nbsp; (version 5.0, released 16 August 2013)
+        </div>
+      </div>
+      
+      <hr />
+
+      <div class="row">
+        <div class="span8">
+
+          <p>This document will walk you through using the pipeline in a 
variety of scenarios. Once youâve gained a
+sense for how the pipeline works, you can consult the <a 
href="pipeline.html">pipeline page</a> for a number of
+other options available in the pipeline.</p>
+
+<h2 id="download-and-setup">Download and Setup</h2>
+
+<p>Download and install Joshua as described on the <a href="index.html">quick 
start page</a>, installing it under
+<code class="highlighter-rouge">~/code/</code>. Once youâve done that, you 
should make sure you have the following environment variable set:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>export 
JOSHUA=$HOME/code/joshua-v5.0
+export JAVA_HOME=/usr/java/default
+</code></pre>
+</div>
+
+<p>If you have a Hadoop installation, make sure youâve set <code 
class="highlighter-rouge">$HADOOP</code> to point to it (if not, Joshua
+will roll out a standalone cluster for you). If youâd like to use kbmira for 
tuning, you should also
+install Moses, and define the environment variable <code 
class="highlighter-rouge">$MOSES</code> to point to the root of its 
installation.</p>
+
+<h2 id="a-basic-pipeline-run">A basic pipeline run</h2>
+
+<p>For todayâs experiments, weâll be building a BengaliâEnglish system 
using data included in the
+<a href="/indian-parallel-corpora/">Indian Languages Parallel Corpora</a>. 
This data was collected by taking
+the 100 most-popular Bengali Wikipedia pages and translating them into English 
using Amazonâs
+<a href="http://www.mturk.com/";>Mechanical Turk</a>. As a warning, many of 
these pages contain material that is
+not typically found in machine translation tutorials.</p>
+
+<p>Download the data and install it somewhere:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>cd ~/data
+wget -q --no-check -O indian-parallel-corpora.zip 
https://github.com/joshua-decoder/indian-parallel-corpora/archive/master.zip
+unzip indian-parallel-corpora.zip
+</code></pre>
+</div>
+
+<p>Then define the environment variable <code 
class="highlighter-rouge">$INDIAN</code> to point to it:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>cd 
~/data/indian-parallel-corpora-master
+export INDIAN=$(pwd)
+</code></pre>
+</div>
+
+<h3 id="preparing-the-data">Preparing the data</h3>
+
+<p>Inside this tarball is a directory for each language pair. Within each 
language directory is another
+directory named <code class="highlighter-rouge">tok/</code>, which contains 
pre-tokenized and normalized versions of the data. This was
+done because the normalization scripts provided with Joshua is written in 
scripting languages that
+often have problems properly handling UTF-8 character sets. We will be using 
these tokenized
+versions, and preventing the pipeline from retokenizing using the <code 
class="highlighter-rouge">--no-prepare</code> flag.</p>
+
+<p>In <code class="highlighter-rouge">$INDIAN/bn-en/tok</code>, you should see 
the following files:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>$ ls 
$INDIAN/bn-en/tok
+dev.bn-en.bn     devtest.bn-en.bn     dict.bn-en.bn     test.bn-en.en.2
+dev.bn-en.en.0   devtest.bn-en.en.0   dict.bn-en.en     test.bn-en.en.3
+dev.bn-en.en.1   devtest.bn-en.en.1   test.bn-en.bn     training.bn-en.bn
+dev.bn-en.en.2   devtest.bn-en.en.2   test.bn-en.en.0   training.bn-en.en
+dev.bn-en.en.3   devtest.bn-en.en.3   test.bn-en.en.1
+</code></pre>
+</div>
+
+<p>We will now use this data to test the complete pipeline with a single 
command.</p>
+
+<h3 id="run-the-pipeline">Run the pipeline</h3>
+
+<p>Create an experiments directory for containing your first experiment:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>mkdir 
~/expts/joshua
+cd ~/expts/joshua
+</code></pre>
+</div>
+
+<p>We will now create the baseline run, using a particular directory structure 
for experiments that
+will allow us to take advantage of scripts provided with Joshua for displaying 
the results of many
+related experiments.</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>cd ~/expts/joshua
+$JOSHUA/bin/pipeline.pl           \
+  --rundir 1                      \
+  --readme "Baseline Hiero run"   \
+  --source bn                     \
+  --target en                     \
+  --corpus $INDIAN/bn-en/tok/training.bn-en \
+  --corpus $INDIAN/bn-en/tok/dict.bn-en     \
+  --tune $INDIAN/bn-en/tok/dev.bn-en        \
+  --test $INDIAN/bn-en/tok/devtest.bn-en    \
+  --lm-order 3
+</code></pre>
+</div>
+
+<p>This will start the pipeline building a BengaliâEnglish translation 
system constructed from the
+training data and a dictionary, tuned against dev, and tested against devtest. 
It will use the
+default values for most of the pipeline: <a 
href="https://code.google.com/p/giza-pp/";>GIZA++</a> for alignment,
+KenLMâs <code class="highlighter-rouge">lmplz</code> for building the 
language model, Z-MERT for tuning, KenLM with left-state
+minimization for representing LM state in the decoder, and so on. We change 
the order of the n-gram
+model to 3 (from its default of 5) because there is not enough data to build a 
5-gram LM.</p>
+
+<p>A few notes:</p>
+
+<ul>
+  <li>
+    <p>This will likely take many hours to run, especially if you donât have 
a Hadoop cluster.</p>
+  </li>
+  <li>
+    <p>If you are running on Mac OS X, KenLMâs <code 
class="highlighter-rouge">lmplz</code> will not build due to the absence of 
static
+libraries. In that case, you should add the flag <code 
class="highlighter-rouge">--lm-gen srilm</code> (recommended, if SRILM is
+installed) or <code class="highlighter-rouge">--lm-gen berkeleylm</code>.</p>
+  </li>
+</ul>
+
+<h3 id="variations">Variations</h3>
+
+<p>Once that is finished, you will have a baseline model. From there, you 
might wish to try variations
+of the baseline model. Here are some examples of what you could vary:</p>
+
+<ul>
+  <li>
+    <p>Build an SAMT model (<code class="highlighter-rouge">--type 
samt</code>), GKHM model (<code class="highlighter-rouge">--type ghkm</code>), 
or phrasal ITG model (<code class="highlighter-rouge">--type phrasal</code>) 
</p>
+  </li>
+  <li>
+    <p>Use the Berkeley aligner instead of GIZA++ (<code 
class="highlighter-rouge">--aligner berkeley</code>)</p>
+  </li>
+  <li>
+    <p>Build the language model with BerkeleyLM (<code 
class="highlighter-rouge">--lm-gen srilm</code>) instead of KenLM (the 
default)</p>
+  </li>
+  <li>
+    <p>Change the order of the LM from the default of 5 (<code 
class="highlighter-rouge">--lm-order 4</code>)</p>
+  </li>
+  <li>
+    <p>Tune with MIRA instead of MERT (<code class="highlighter-rouge">--tuner 
mira</code>). This requires that Moses is installed.</p>
+  </li>
+  <li>
+    <p>Decode with a wider beam (<code class="highlighter-rouge">--joshua-args 
'-pop-limit 200'</code>) (the default is 100)</p>
+  </li>
+  <li>
+    <p>Add the provided BN-EN dictionary to the training data (add another 
<code class="highlighter-rouge">--corpus</code> line, e.g., <code 
class="highlighter-rouge">--corpus $INDIAN/bn-en/dict.bn-en</code>)</p>
+  </li>
+</ul>
+
+<p>To do this, we will create new runs that partially reuse the results of 
previous runs. This is
+possible by doing two things: (1) incrementing the run directory and providing 
an updated README
+note; (2) telling the pipeline which of the many steps of the pipeline to 
begin at; and (3)
+providing the needed dependencies.</p>
+
+<h1 id="a-second-run">A second run</h1>
+
+<p>Letâs begin by changing the tuner, to see what effect that has. To do so, 
we change the run
+directory, tell the pipeline to start at the tuning step, and provide the 
needed dependencies:</p>
+
+<div class="highlighter-rouge"><pre 
class="highlight"><code>$JOSHUA/bin/pipeline.pl           \
+  --rundir 2                      \
+  --readme "Tuning with MIRA"     \
+  --source bn                     \
+  --target en                     \
+  --corpus $INDIAN/bn-en/tok/training.bn-en \
+  --tune $INDIAN/bn-en/tok/dev.bn-en        \
+  --test $INDIAN/bn-en/tok/devtest.bn-en    \
+  --first-step tune \
+  --tuner mira \
+  --grammar 1/grammar.gz \
+  --no-corpus-lm \
+  --lmfile 1/lm.gz
+</code></pre>
+</div>
+
+<p>Here, we have essentially the same invocation, but we have told the 
pipeline to use a different
+ MIRA, to start with tuning, and have provided it with the language model file 
and grammar it needs
+ to execute the tuning step. </p>
+
+<p>Note that we have also told it not to build a language model. This is 
necessary because the
+ pipeline always builds an LM on the target side of the training data, if 
provided, but we are
+ supplying the language model that was already built. We could equivalently 
have removed the
+ <code class="highlighter-rouge">--corpus</code> line.</p>
+
+<h2 id="changing-the-model-type">Changing the model type</h2>
+
+<p>Letâs compare the Hiero model weâve already built to an SAMT model. We 
have to reextract the
+grammar, but can reuse the alignments and the language model:</p>
+
+<div class="highlighter-rouge"><pre 
class="highlight"><code>$JOSHUA/bin/pipeline.pl           \
+  --rundir 3                      \
+  --readme "Baseline SAMT model"  \
+  --source bn                     \
+  --target en                     \
+  --corpus $INDIAN/bn-en/tok/training.bn-en \
+  --tune $INDIAN/bn-en/tok/dev.bn-en        \
+  --test $INDIAN/bn-en/tok/devtest.bn-en    \
+  --alignment 1/alignments/training.align   \
+  --first-step parse \
+  --no-corpus-lm \
+  --lmfile 1/lm.gz
+</code></pre>
+</div>
+
+<p>See <a href="pipeline.html#steps">the pipeline script page</a> for a list 
of all the steps.</p>
+
+<h2 id="analyzing-the-results">Analyzing the results</h2>
+
+<p>We now have three runs, in subdirectories 1, 2, and 3. We can display 
summary results from them
+using the <code 
class="highlighter-rouge">$JOSHUA/scripts/training/summarize.pl</code> 
script.</p>
+
+
+        </div>
+      </div>
+    </div> <!-- /container -->
+
+    <!-- Le javascript
+    ================================================== -->
+    <!-- Placed at the end of the document so the pages load faster -->
+    <script src="bootstrap/js/jquery.js"></script>
+    <script src="bootstrap/js/bootstrap-transition.js"></script>
+    <script src="bootstrap/js/bootstrap-alert.js"></script>
+    <script src="bootstrap/js/bootstrap-modal.js"></script>
+    <script src="bootstrap/js/bootstrap-dropdown.js"></script>
+    <script src="bootstrap/js/bootstrap-scrollspy.js"></script>
+    <script src="bootstrap/js/bootstrap-tab.js"></script>
+    <script src="bootstrap/js/bootstrap-tooltip.js"></script>
+    <script src="bootstrap/js/bootstrap-popover.js"></script>
+    <script src="bootstrap/js/bootstrap-button.js"></script>
+    <script src="bootstrap/js/bootstrap-collapse.js"></script>
+    <script src="bootstrap/js/bootstrap-carousel.js"></script>
+    <script src="bootstrap/js/bootstrap-typeahead.js"></script>
+
+    <!-- Start of StatCounter Code for Default Guide -->
+    <script type="text/javascript">
+      var sc_project=8264132; 
+      var sc_invisible=1; 
+      var sc_security="4b97fe2d"; 
+    </script>
+    <script type="text/javascript" 
src="http://www.statcounter.com/counter/counter.js";></script>
+    <noscript>
+      <div class="statcounter">
+        <a title="hit counter joomla" 
+           href="http://statcounter.com/joomla/";
+           target="_blank">
+          <img class="statcounter"
+               src="http://c.statcounter.com/8264132/0/4b97fe2d/1/";
+               alt="hit counter joomla" />
+        </a>
+      </div>
+    </noscript>
+    <!-- End of StatCounter Code for Default Guide -->
+
+  </body>
+</html>

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/tutorial.md
----------------------------------------------------------------------
diff --git a/5.0/tutorial.md b/5.0/tutorial.md
deleted file mode 100644
index 038db9f..0000000
--- a/5.0/tutorial.md
+++ /dev/null
@@ -1,174 +0,0 @@
----
-layout: default
-category: links
-title: Pipeline tutorial
----
-
-This document will walk you through using the pipeline in a variety of 
scenarios. Once you've gained a
-sense for how the pipeline works, you can consult the [pipeline 
page](pipeline.html) for a number of
-other options available in the pipeline.
-
-## Download and Setup
-
-Download and install Joshua as described on the [quick start 
page](index.html), installing it under
-`~/code/`. Once you've done that, you should make sure you have the following 
environment variable set:
-
-    export JOSHUA=$HOME/code/joshua-v5.0
-    export JAVA_HOME=/usr/java/default
-
-If you have a Hadoop installation, make sure you've set `$HADOOP` to point to 
it (if not, Joshua
-will roll out a standalone cluster for you). If you'd like to use kbmira for 
tuning, you should also
-install Moses, and define the environment variable `$MOSES` to point to the 
root of its installation.
-
-## A basic pipeline run
-
-For today's experiments, we'll be building a Bengali--English system using 
data included in the
-[Indian Languages Parallel Corpora](/indian-parallel-corpora/). This data was 
collected by taking
-the 100 most-popular Bengali Wikipedia pages and translating them into English 
using Amazon's
-[Mechanical Turk](http://www.mturk.com/). As a warning, many of these pages 
contain material that is
-not typically found in machine translation tutorials.
-
-Download the data and install it somewhere:
-
-    cd ~/data
-    wget -q --no-check -O indian-parallel-corpora.zip 
https://github.com/joshua-decoder/indian-parallel-corpora/archive/master.zip
-    unzip indian-parallel-corpora.zip
-
-Then define the environment variable `$INDIAN` to point to it:
-
-    cd ~/data/indian-parallel-corpora-master
-    export INDIAN=$(pwd)
-    
-### Preparing the data
-
-Inside this tarball is a directory for each language pair. Within each 
language directory is another
-directory named `tok/`, which contains pre-tokenized and normalized versions 
of the data. This was
-done because the normalization scripts provided with Joshua is written in 
scripting languages that
-often have problems properly handling UTF-8 character sets. We will be using 
these tokenized
-versions, and preventing the pipeline from retokenizing using the 
`--no-prepare` flag.
-
-In `$INDIAN/bn-en/tok`, you should see the following files:
-
-    $ ls $INDIAN/bn-en/tok
-    dev.bn-en.bn     devtest.bn-en.bn     dict.bn-en.bn     test.bn-en.en.2
-    dev.bn-en.en.0   devtest.bn-en.en.0   dict.bn-en.en     test.bn-en.en.3
-    dev.bn-en.en.1   devtest.bn-en.en.1   test.bn-en.bn     training.bn-en.bn
-    dev.bn-en.en.2   devtest.bn-en.en.2   test.bn-en.en.0   training.bn-en.en
-    dev.bn-en.en.3   devtest.bn-en.en.3   test.bn-en.en.1
-
-We will now use this data to test the complete pipeline with a single command.
-    
-### Run the pipeline
-
-Create an experiments directory for containing your first experiment:
-
-    mkdir ~/expts/joshua
-    cd ~/expts/joshua
-    
-We will now create the baseline run, using a particular directory structure 
for experiments that
-will allow us to take advantage of scripts provided with Joshua for displaying 
the results of many
-related experiments.
-
-    cd ~/expts/joshua
-    $JOSHUA/bin/pipeline.pl           \
-      --rundir 1                      \
-      --readme "Baseline Hiero run"   \
-      --source bn                     \
-      --target en                     \
-      --corpus $INDIAN/bn-en/tok/training.bn-en \
-      --corpus $INDIAN/bn-en/tok/dict.bn-en     \
-      --tune $INDIAN/bn-en/tok/dev.bn-en        \
-      --test $INDIAN/bn-en/tok/devtest.bn-en    \
-      --lm-order 3
-      
-This will start the pipeline building a Bengali--English translation system 
constructed from the
-training data and a dictionary, tuned against dev, and tested against devtest. 
It will use the
-default values for most of the pipeline: 
[GIZA++](https://code.google.com/p/giza-pp/) for alignment,
-KenLM's `lmplz` for building the language model, Z-MERT for tuning, KenLM with 
left-state
-minimization for representing LM state in the decoder, and so on. We change 
the order of the n-gram
-model to 3 (from its default of 5) because there is not enough data to build a 
5-gram LM.
-
-A few notes:
-
-- This will likely take many hours to run, especially if you don't have a 
Hadoop cluster.
-
-- If you are running on Mac OS X, KenLM's `lmplz` will not build due to the 
absence of static
-  libraries. In that case, you should add the flag `--lm-gen srilm` 
(recommended, if SRILM is
-  installed) or `--lm-gen berkeleylm`.
-
-### Variations
-
-Once that is finished, you will have a baseline model. From there, you might 
wish to try variations
-of the baseline model. Here are some examples of what you could vary:
-
-- Build an SAMT model (`--type samt`), GKHM model (`--type ghkm`), or phrasal 
ITG model (`--type phrasal`) 
-   
-- Use the Berkeley aligner instead of GIZA++ (`--aligner berkeley`)
-   
-- Build the language model with BerkeleyLM (`--lm-gen srilm`) instead of KenLM 
(the default)
-
-- Change the order of the LM from the default of 5 (`--lm-order 4`)
-
-- Tune with MIRA instead of MERT (`--tuner mira`). This requires that Moses is 
installed.
-   
-- Decode with a wider beam (`--joshua-args '-pop-limit 200'`) (the default is 
100)
-
-- Add the provided BN-EN dictionary to the training data (add another 
`--corpus` line, e.g., `--corpus $INDIAN/bn-en/dict.bn-en`)
-
-To do this, we will create new runs that partially reuse the results of 
previous runs. This is
-possible by doing two things: (1) incrementing the run directory and providing 
an updated README
-note; (2) telling the pipeline which of the many steps of the pipeline to 
begin at; and (3)
-providing the needed dependencies.
-
-# A second run
-
-Let's begin by changing the tuner, to see what effect that has. To do so, we 
change the run
-directory, tell the pipeline to start at the tuning step, and provide the 
needed dependencies:
-
-    $JOSHUA/bin/pipeline.pl           \
-      --rundir 2                      \
-      --readme "Tuning with MIRA"     \
-      --source bn                     \
-      --target en                     \
-      --corpus $INDIAN/bn-en/tok/training.bn-en \
-      --tune $INDIAN/bn-en/tok/dev.bn-en        \
-      --test $INDIAN/bn-en/tok/devtest.bn-en    \
-      --first-step tune \
-      --tuner mira \
-      --grammar 1/grammar.gz \
-      --no-corpus-lm \
-      --lmfile 1/lm.gz
-      
- Here, we have essentially the same invocation, but we have told the pipeline 
to use a different
- MIRA, to start with tuning, and have provided it with the language model file 
and grammar it needs
- to execute the tuning step. 
- 
- Note that we have also told it not to build a language model. This is 
necessary because the
- pipeline always builds an LM on the target side of the training data, if 
provided, but we are
- supplying the language model that was already built. We could equivalently 
have removed the
- `--corpus` line.
- 
-## Changing the model type
-
-Let's compare the Hiero model we've already built to an SAMT model. We have to 
reextract the
-grammar, but can reuse the alignments and the language model:
-
-    $JOSHUA/bin/pipeline.pl           \
-      --rundir 3                      \
-      --readme "Baseline SAMT model"  \
-      --source bn                     \
-      --target en                     \
-      --corpus $INDIAN/bn-en/tok/training.bn-en \
-      --tune $INDIAN/bn-en/tok/dev.bn-en        \
-      --test $INDIAN/bn-en/tok/devtest.bn-en    \
-      --alignment 1/alignments/training.align   \
-      --first-step parse \
-      --no-corpus-lm \
-      --lmfile 1/lm.gz
-
-See [the pipeline script page](pipeline.html#steps) for a list of all the 
steps.
-
-## Analyzing the results
-
-We now have three runs, in subdirectories 1, 2, and 3. We can display summary 
results from them
-using the `$JOSHUA/scripts/training/summarize.pl` script.

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/zmert.html
----------------------------------------------------------------------
diff --git a/5.0/zmert.html b/5.0/zmert.html
new file mode 100644
index 0000000..c32ec6e
--- /dev/null
+++ b/5.0/zmert.html
@@ -0,0 +1,252 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8">
+    <title>Joshua Documentation | Z-MERT</title>
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <meta name="description" content="">
+    <meta name="author" content="">
+
+    <!-- Le styles -->
+    <link href="/bootstrap/css/bootstrap.css" rel="stylesheet">
+    <style>
+      body {
+        padding-top: 60px; /* 60px to make the container go all the way to the 
bottom of the topbar */
+      }
+      #download {
+          background-color: green;
+          font-size: 14pt;
+          font-weight: bold;
+          text-align: center;
+          color: white;
+          border-radius: 5px;
+          padding: 4px;
+      }
+
+      #download a:link {
+          color: white;
+      }
+
+      #download a:hover {
+          color: lightgrey;
+      }
+
+      #download a:visited {
+          color: white;
+      }
+
+      a.pdf {
+          font-variant: small-caps;
+          /* font-weight: bold; */
+          font-size: 10pt;
+          color: white;
+          background: brown;
+          padding: 2px;
+      }
+
+      a.bibtex {
+          font-variant: small-caps;
+          /* font-weight: bold; */
+          font-size: 10pt;
+          color: white;
+          background: orange;
+          padding: 2px;
+      }
+
+      img.sponsor {
+        height: 120px;
+        margin: 5px;
+      }
+    </style>
+    <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet">
+
+    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
+    <!--[if lt IE 9]>
+      <script src="bootstrap/js/html5shiv.js"></script>
+    <![endif]-->
+
+    <!-- Fav and touch icons -->
+    <link rel="apple-touch-icon-precomposed" sizes="144x144" 
href="bootstrap/ico/apple-touch-icon-144-precomposed.png">
+    <link rel="apple-touch-icon-precomposed" sizes="114x114" 
href="bootstrap/ico/apple-touch-icon-114-precomposed.png">
+      <link rel="apple-touch-icon-precomposed" sizes="72x72" 
href="bootstrap/ico/apple-touch-icon-72-precomposed.png">
+                    <link rel="apple-touch-icon-precomposed" 
href="bootstrap/ico/apple-touch-icon-57-precomposed.png">
+                                   <link rel="shortcut icon" 
href="bootstrap/ico/favicon.png">
+  </head>
+
+  <body>
+
+    <div class="navbar navbar-inverse navbar-fixed-top">
+      <div class="navbar-inner">
+        <div class="container">
+          <button type="button" class="btn btn-navbar" data-toggle="collapse" 
data-target=".nav-collapse">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <a class="brand" href="/">Joshua</a>
+          <div class="nav-collapse collapse">
+            <ul class="nav">
+              <li><a href="index.html">Documentation</a></li>
+              <li><a href="pipeline.html">Pipeline</a></li>
+              <li><a href="tutorial.html">Tutorial</a></li>
+              <li><a href="decoder.html">Decoder</a></li>
+              <li><a href="thrax.html">Thrax</a></li>
+              <li><a href="file-formats.html">File formats</a></li>
+              <!-- <li><a href="advanced.html">Advanced</a></li> -->
+              <li><a href="faq.html">FAQ</a></li>
+            </ul>
+          </div><!--/.nav-collapse -->
+        </div>
+      </div>
+    </div>
+
+    <div class="container">
+
+      <div class="row">
+        <div class="span2">
+          <img src="/images/joshua-logo-small.png" 
+               alt="Joshua logo (picture of a Joshua tree)" />
+        </div>
+        <div class="span10">
+          <h1>Joshua Documentation</h1>
+          <h2>Z-MERT</h2>
+          <span id="download">
+            <a 
href="http://cs.jhu.edu/~post/files/joshua-v5.0.tgz";>Download</a>
+          </span>
+          &nbsp; (version 5.0, released 16 August 2013)
+        </div>
+      </div>
+      
+      <hr />
+
+      <div class="row">
+        <div class="span8">
+
+          <p>This document describes how to manually run the ZMERT module.  
ZMERT is Joshuaâs minimum error-rate
+training module, written by Omar F. Zaidan.  It is easily adapted to drop in 
different decoders, and
+was also written so as to work with different objective functions (other than 
BLEU).</p>
+
+<p>((Section (1) in <code 
class="highlighter-rouge">$JOSHUA/examples/ZMERT/README_ZMERT.txt</code> is an 
expanded version of this section))</p>
+
+<p>Z-MERT, can be used by launching the driver program (<code 
class="highlighter-rouge">ZMERT.java</code>), which expects a config file as
+its main argument.  This config file can be used to specify any subset of 
Z-MERTâs 20-some
+parameters.  For a full list of those parameters, and their default values, 
run ZMERT with a single
+-h argument as follows:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>java -cp 
$JOSHUA/bin joshua.zmert.ZMERT -h
+</code></pre>
+</div>
+
+<p>So what does a Z-MERT config file look like?</p>
+
+<p>Examine the file <code 
class="highlighter-rouge">examples/ZMERT/ZMERT_config_ex2.txt</code>.  You will 
find that it
+specifies the following âmainâ MERT parameters:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>(*) -dir 
dirPrefix:         working directory
+(*) -s sourceFile:          source sentences (foreign sentences) of the MERT 
dataset
+(*) -r refFile:             target sentences (reference translations) of the 
MERT dataset
+(*) -rps refsPerSen:        number of reference translations per sentence
+(*) -p paramsFile:          file containing parameter names, initial values, 
and ranges
+(*) -maxIt maxMERTIts:      maximum number of MERT iterations
+(*) -ipi initsPerIt:        number of intermediate initial points per iteration
+(*) -cmd commandFile:       name of file containing commands to run the decoder
+(*) -decOut decoderOutFile: name of the output file produced by the decoder
+(*) -dcfg decConfigFile:    name of decoder config file
+(*) -N N:                   size of N-best list (per sentence) generated in 
each MERT iteration
+(*) -v verbosity:           output verbosity level (0-2; higher value =&gt; 
more verbose)
+(*) -seed seed:             seed used to initialize the random number generator
+</code></pre>
+</div>
+
+<p>(Note that the <code class="highlighter-rouge">-s</code> parameter is only 
used if Z-MERT is running Joshua as an
+ internal decoder.  If Joshua is run as an external decoder, as is the case in
+ this README, then this parameter is ignored.)</p>
+
+<p>To test Z-MERT on the 100-sentence test set of example2, provide this config
+file to Z-MERT as follows:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>java -cp bin 
joshua.zmert.ZMERT -maxMem 500 examples/ZMERT/ZMERT_config_ex2.txt &gt; 
examples/ZMERT/ZMERT_example/ZMERT.out
+</code></pre>
+</div>
+
+<p>This will run Z-MERT for a couple of iterations on the data from the 
example2
+folder.  (Notice that we have made copies of the source and reference files
+from example2 and renamed them as src.txt and ref.* in the MERT_example folder,
+just to have all the files needed by Z-MERT in one place.)  Once the Z-MERT run
+is complete, you should be able to inspect the log file to see what kinds of
+things it did.  If everything goes well, the run should take a few minutes, of
+which more than 95% is time spent by Z-MERT waiting on Joshua to finish
+decoding the sentences (once per iteration).</p>
+
+<p>The output file you get should be equivalent to <code 
class="highlighter-rouge">ZMERT.out.verbosity1</code>.  If you
+rerun the experiment with the verbosity (-v) argument set to 2 instead of 1,
+the output file you get should be equivalent to <code 
class="highlighter-rouge">ZMERT.out.verbosity2</code>, which has
+more interesting details about what Z-MERT does.</p>
+
+<p>Notice the additional <code class="highlighter-rouge">-maxMem</code> 
argument.  It tells Z-MERT that it should not
+persist to use up memory while the decoder is running (during which time Z-MERT
+would be idle).  The 500 tells Z-MERT that it can only use a maximum of 500 MB.
+For more details on this issue, see section (4) in Z-MERTâs README.</p>
+
+<p>A quick note about Z-MERTâs interaction with the decoder.  If you examine 
the
+file <code class="highlighter-rouge">decoder_command_ex2.txt</code>, which is 
provided as the commandFile (<code class="highlighter-rouge">-cmd</code>)
+argument in Z-MERTâs config file, youâll find it contains the command one 
would
+use to run the decoder.  Z-MERT launches the commandFile as an external
+process, and assumes that it will launch the decoder to produce translations.
+(Make sure that commandFile is executable.)  After launching this external
+process, Z-MERT waits for it to finish, then uses the resulting output file for
+parameter tuning (in addition to the output files from previous iterations).
+The command file here only has a single command, but your command file could
+have multiple lines.  Just make sure the command file itself is executable.</p>
+
+<p>Notice that the Z-MERT arguments <code 
class="highlighter-rouge">configFile</code> and <code 
class="highlighter-rouge">decoderOutFile</code> (<code 
class="highlighter-rouge">-cfg</code> and
+<code class="highlighter-rouge">-decOut</code>) must match the two Joshua 
arguments in the commandFileâs (<code class="highlighter-rouge">-cmd</code>) 
single
+command.  Also, the Z-MERT argument for N must match the value for <code 
class="highlighter-rouge">top_n</code> in
+Joshuaâs config file, indicated by the Z-MERT argument configFile (<code 
class="highlighter-rouge">-cfg</code>).</p>
+
+<p>For more details on Z-MERT, refer to <code 
class="highlighter-rouge">$JOSHUA/examples/ZMERT/README_ZMERT.txt</code></p>
+
+
+        </div>
+      </div>
+    </div> <!-- /container -->
+
+    <!-- Le javascript
+    ================================================== -->
+    <!-- Placed at the end of the document so the pages load faster -->
+    <script src="bootstrap/js/jquery.js"></script>
+    <script src="bootstrap/js/bootstrap-transition.js"></script>
+    <script src="bootstrap/js/bootstrap-alert.js"></script>
+    <script src="bootstrap/js/bootstrap-modal.js"></script>
+    <script src="bootstrap/js/bootstrap-dropdown.js"></script>
+    <script src="bootstrap/js/bootstrap-scrollspy.js"></script>
+    <script src="bootstrap/js/bootstrap-tab.js"></script>
+    <script src="bootstrap/js/bootstrap-tooltip.js"></script>
+    <script src="bootstrap/js/bootstrap-popover.js"></script>
+    <script src="bootstrap/js/bootstrap-button.js"></script>
+    <script src="bootstrap/js/bootstrap-collapse.js"></script>
+    <script src="bootstrap/js/bootstrap-carousel.js"></script>
+    <script src="bootstrap/js/bootstrap-typeahead.js"></script>
+
+    <!-- Start of StatCounter Code for Default Guide -->
+    <script type="text/javascript">
+      var sc_project=8264132; 
+      var sc_invisible=1; 
+      var sc_security="4b97fe2d"; 
+    </script>
+    <script type="text/javascript" 
src="http://www.statcounter.com/counter/counter.js";></script>
+    <noscript>
+      <div class="statcounter">
+        <a title="hit counter joomla" 
+           href="http://statcounter.com/joomla/";
+           target="_blank">
+          <img class="statcounter"
+               src="http://c.statcounter.com/8264132/0/4b97fe2d/1/";
+               alt="hit counter joomla" />
+        </a>
+      </div>
+    </noscript>
+    <!-- End of StatCounter Code for Default Guide -->
+
+  </body>
+</html>

http://git-wip-us.apache.org/repos/asf/incubator-joshua-site/blob/53cc3005/5.0/zmert.md
----------------------------------------------------------------------
diff --git a/5.0/zmert.md b/5.0/zmert.md
deleted file mode 100644
index d6a5d3c..0000000
--- a/5.0/zmert.md
+++ /dev/null
@@ -1,83 +0,0 @@
----
-layout: default
-category: advanced
-title: Z-MERT
----
-
-This document describes how to manually run the ZMERT module.  ZMERT is 
Joshua's minimum error-rate
-training module, written by Omar F. Zaidan.  It is easily adapted to drop in 
different decoders, and
-was also written so as to work with different objective functions (other than 
BLEU).
-
-((Section (1) in `$JOSHUA/examples/ZMERT/README_ZMERT.txt` is an expanded 
version of this section))
-
-Z-MERT, can be used by launching the driver program (`ZMERT.java`), which 
expects a config file as
-its main argument.  This config file can be used to specify any subset of 
Z-MERT's 20-some
-parameters.  For a full list of those parameters, and their default values, 
run ZMERT with a single
--h argument as follows:
-
-    java -cp $JOSHUA/bin joshua.zmert.ZMERT -h
-
-So what does a Z-MERT config file look like?
-
-Examine the file `examples/ZMERT/ZMERT_config_ex2.txt`.  You will find that it
-specifies the following "main" MERT parameters:
-
-    (*) -dir dirPrefix:         working directory
-    (*) -s sourceFile:          source sentences (foreign sentences) of the 
MERT dataset
-    (*) -r refFile:             target sentences (reference translations) of 
the MERT dataset
-    (*) -rps refsPerSen:        number of reference translations per sentence
-    (*) -p paramsFile:          file containing parameter names, initial 
values, and ranges
-    (*) -maxIt maxMERTIts:      maximum number of MERT iterations
-    (*) -ipi initsPerIt:        number of intermediate initial points per 
iteration
-    (*) -cmd commandFile:       name of file containing commands to run the 
decoder
-    (*) -decOut decoderOutFile: name of the output file produced by the decoder
-    (*) -dcfg decConfigFile:    name of decoder config file
-    (*) -N N:                   size of N-best list (per sentence) generated 
in each MERT iteration
-    (*) -v verbosity:           output verbosity level (0-2; higher value => 
more verbose)
-    (*) -seed seed:             seed used to initialize the random number 
generator
-
-(Note that the `-s` parameter is only used if Z-MERT is running Joshua as an
- internal decoder.  If Joshua is run as an external decoder, as is the case in
- this README, then this parameter is ignored.)
-
-To test Z-MERT on the 100-sentence test set of example2, provide this config
-file to Z-MERT as follows:
-
-    java -cp bin joshua.zmert.ZMERT -maxMem 500 
examples/ZMERT/ZMERT_config_ex2.txt > examples/ZMERT/ZMERT_example/ZMERT.out
-
-This will run Z-MERT for a couple of iterations on the data from the example2
-folder.  (Notice that we have made copies of the source and reference files
-from example2 and renamed them as src.txt and ref.* in the MERT_example folder,
-just to have all the files needed by Z-MERT in one place.)  Once the Z-MERT run
-is complete, you should be able to inspect the log file to see what kinds of
-things it did.  If everything goes well, the run should take a few minutes, of
-which more than 95% is time spent by Z-MERT waiting on Joshua to finish
-decoding the sentences (once per iteration).
-
-The output file you get should be equivalent to `ZMERT.out.verbosity1`.  If you
-rerun the experiment with the verbosity (-v) argument set to 2 instead of 1,
-the output file you get should be equivalent to `ZMERT.out.verbosity2`, which 
has
-more interesting details about what Z-MERT does.
-
-Notice the additional `-maxMem` argument.  It tells Z-MERT that it should not
-persist to use up memory while the decoder is running (during which time Z-MERT
-would be idle).  The 500 tells Z-MERT that it can only use a maximum of 500 MB.
-For more details on this issue, see section (4) in Z-MERT's README.
-
-A quick note about Z-MERT's interaction with the decoder.  If you examine the
-file `decoder_command_ex2.txt`, which is provided as the commandFile (`-cmd`)
-argument in Z-MERT's config file, you'll find it contains the command one would
-use to run the decoder.  Z-MERT launches the commandFile as an external
-process, and assumes that it will launch the decoder to produce translations.
-(Make sure that commandFile is executable.)  After launching this external
-process, Z-MERT waits for it to finish, then uses the resulting output file for
-parameter tuning (in addition to the output files from previous iterations).
-The command file here only has a single command, but your command file could
-have multiple lines.  Just make sure the command file itself is executable.
-
-Notice that the Z-MERT arguments `configFile` and `decoderOutFile` (`-cfg` and
-`-decOut`) must match the two Joshua arguments in the commandFile's (`-cmd`) 
single
-command.  Also, the Z-MERT argument for N must match the value for `top_n` in
-Joshua's config file, indicated by the Z-MERT argument configFile (`-cfg`).
-
-For more details on Z-MERT, refer to `$JOSHUA/examples/ZMERT/README_ZMERT.txt`

[36/44] incubator-joshua-site git commit: First attempt

Reply via email to