user...

buildbot Fri, 08 Apr 2016 11:42:15 -0700

Modified: 
websites/staging/mahout/trunk/content/users/clustering/streaming-k-means.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/clustering/streaming-k-means.html 
(original)
+++ 
websites/staging/mahout/trunk/content/users/clustering/streaming-k-means.html 
Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,7 +264,18 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="streamingkmeans-algorithm"><em>StreamingKMeans</em> algorithm</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="streamingkmeans-algorithm"><em>StreamingKMeans</em> algorithm<a 
class="headerlink" href="#streamingkmeans-algorithm" title="Permanent 
link">&para;</a></h1>
 <p>The <em>StreamingKMeans</em> algorithm is a variant of Algorithm 1 from <a 
href="http://nips.cc/Conferences/2011/Program/event.php?ID=2989"; title="M. 
Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For Large 
Datasets">Shindler et al</a> and consists of two steps:</p>
 <ol>
 <li>Streaming step </li>
@@ -276,9 +288,9 @@ expected number of clusters is <em>k</em
 clusters that will be passed on to the BallKMeans step which will further 
reduce the 
 number of clusters down to <em>k</em>. BallKMeans is a randomized Lloyd-type 
algorithm that
 has been studied in detail, see <a 
href="http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf"; title="R. 
Ostrovsky, Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of Lloyd-Type 
Methods for the k-means Problem">Ostrovsky et al</a>.</p>
-<h2 id="streaming-step">Streaming step</h2>
+<h2 id="streaming-step">Streaming step<a class="headerlink" 
href="#streaming-step" title="Permanent link">&para;</a></h2>
 <hr />
-<h3 id="overview">Overview</h3>
+<h3 id="overview">Overview<a class="headerlink" href="#overview" 
title="Permanent link">&para;</a></h3>
 <p>The streaming step is a derivative of the streaming 
 portion of Algorithm 1 in <a 
href="http://nips.cc/Conferences/2011/Program/event.php?ID=2989"; title="M. 
Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For Large 
Datasets">Shindler et al</a>. The main difference between the two is that 
 Algorithm 1 of <a 
href="http://nips.cc/Conferences/2011/Program/event.php?ID=2989"; title="M. 
Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For Large 
Datasets">Shindler et al</a> assumes 
@@ -290,7 +302,7 @@ In contrast, Mahout implementation does
 data stream. Instead, it dynamically re-evaluates the parameters that depend 
on the size 
 of the data stream at runtime as more and more data is processed. In 
particular, 
 the parameter <em>numClusters</em> (defined below) changes its value as the 
data is processed.   </p>
-<h3 id="parameters">Parameters</h3>
+<h3 id="parameters">Parameters<a class="headerlink" href="#parameters" 
title="Permanent link">&para;</a></h3>
 <ul>
 <li><strong>numClusters</strong> (int): Conceptually, <em>numClusters</em> 
represents the algorithm's guess at the optimal 
 number of clusters it is shooting for. In particular, <em>numClusters</em> 
will increase at run 
@@ -305,7 +317,7 @@ common ratio <em>beta</em> (see below).
 <li><strong>clusterLogFactor</strong> (double): a constant parameter such that 
<em>clusterLogFactor</em> <em>log(numProcessedPoints)</em> is the runtime 
estimate of the number of clusters to be produced by the streaming step. If the 
final number of clusters (that we expect <em>StreamingKMeans</em> to output) is 
<em>k</em>, <em>clusterLogFactor</em> can be set to <em>k</em>.  </li>
 <li><strong>clusterOvershoot</strong> (double): a constant multiplicative 
slack factor that slows down the collapsing of clusters. The default value is 
2. </li>
 </ul>
-<h3 id="algorithm">Algorithm</h3>
+<h3 id="algorithm">Algorithm<a class="headerlink" href="#algorithm" 
title="Permanent link">&para;</a></h3>
 <p>The algorithm processes the data one-by-one and makes only one pass through 
the data.
 The first point from the data stream will form the centroid of the first 
cluster (this designation may change as more points are processed). Suppose 
there are <em>r</em> clusters at one point and a new point <em>p</em> is being 
processed. The new point can either be added to one of the existing <em>r</em> 
clusters or become a new cluster. To decide:</p>
 <ul>
@@ -317,16 +329,16 @@ The first point from the data stream wil
 <p>There will be either <em>r</em> or <em>r+1</em> clusters after processing a 
new point.</p>
 <p>As the number of clusters increases, it will go over the  
<em>clusterOvershoot * numClusters</em> limit (<em>numClusters</em> represents 
a recommendation for the number of clusters that the streaming step should aim 
for and <em>clusterOvershoot</em> is the slack). To decrease the number of 
clusters the existing clusters
 are treated as data points and are re-clustered (collapsed). This tends to 
make the number of clusters go down. If the number of clusters is still too 
high, <em>distanceCutoff</em> is increased.</p>
-<h2 id="ballkmeans-step">BallKMeans step</h2>
+<h2 id="ballkmeans-step">BallKMeans step<a class="headerlink" 
href="#ballkmeans-step" title="Permanent link">&para;</a></h2>
 <hr />
-<h3 id="overview_1">Overview</h3>
+<h3 id="overview_1">Overview<a class="headerlink" href="#overview_1" 
title="Permanent link">&para;</a></h3>
 <p>The algorithm is a Lloyd-type algorithm that takes a set of weighted 
vectors and returns k centroids, see <a 
href="http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf"; title="R. 
Ostrovsky, Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of Lloyd-Type 
Methods for the k-means Problem">Ostrovsky et al</a> for details. The algorithm 
has two stages: </p>
 <ol>
 <li>Seeding </li>
 <li>Ball k-means </li>
 </ol>
 <p>The seeding stage is an initial guess of where the centroids should be. The 
initial guess is improved using the ball k-means stage. </p>
-<h3 id="parameters_1">Parameters</h3>
+<h3 id="parameters_1">Parameters<a class="headerlink" href="#parameters_1" 
title="Permanent link">&para;</a></h3>
 <ul>
 <li>
 <p><strong>numClusters</strong> (int): the number k of centroids to return.  
The algorithm will return exactly this number of centroids.</p>
@@ -350,7 +362,7 @@ are treated as data points and are re-cl
 <p><strong>numRuns</strong> (int): This is the number of runs to perform. The 
solution of lowest cost is returned.  The default is 1 run.</p>
 </li>
 </ul>
-<h3 id="algorithm_1">Algorithm</h3>
+<h3 id="algorithm_1">Algorithm<a class="headerlink" href="#algorithm_1" 
title="Permanent link">&para;</a></h3>
 <p>The algorithm can be instructed to take multiple independent runs (using 
the <em>numRuns</em> parameter) and the algorithm will select the best solution 
(i.e., the one with the lowest cost). In practice, one run is sufficient to 
find a good solution.  </p>
 <p>Each run operates as follows: a seeding procedure is used to select k 
centroids, and then ball k-means is run iteratively to refine the solution.</p>
 <p>The seeding procedure can be set to either 'uniformly at random' or 
'k-means++' using <em>kMeansPlusPlusInit</em> boolean variable. Seeding with 
k-means++ involves more computation but offers better results in practice. </p>
@@ -360,7 +372,7 @@ are treated as data points and are re-cl
 <li>The centers of mass of the trimmed clusters (see <em>trimFraction</em> 
parameter above) become the new centroids </li>
 </ol>
 <p>The data may be partitioned into a test set and a training set (see 
<em>testProbability</em>). The seeding procedure and ball k-means run on the 
training set. The cost is computed on the test set.</p>
-<h2 id="usage-of-streamingkmeans">Usage of <em>StreamingKMeans</em></h2>
+<h2 id="usage-of-streamingkmeans">Usage of <em>StreamingKMeans</em><a 
class="headerlink" href="#usage-of-streamingkmeans" title="Permanent 
link">&para;</a></h2>
 <div class="codehilite"><pre> <span class="n">bin</span><span 
class="o">/</span><span class="n">mahout</span> <span 
class="n">streamingkmeans</span>  
    <span class="o">-</span><span class="nb">i</span> <span 
class="o">&lt;</span><span class="n">input</span><span class="o">&gt;</span>  
    <span class="o">-</span><span class="n">o</span> <span 
class="o">&lt;</span><span class="n">output</span><span class="o">&gt;</span> 
@@ -387,7 +399,7 @@ are treated as data points and are re-cl
 </pre></div>
 
 
-<h3 id="details-on-job-specific-options">Details on Job-Specific Options:</h3>
+<h3 id="details-on-job-specific-options">Details on Job-Specific Options:<a 
class="headerlink" href="#details-on-job-specific-options" title="Permanent 
link">&para;</a></h3>
 <ul>
 <li><code>--input (-i) &lt;input&gt;</code>: Path to job input directory.      
   </li>
 <li><code>--output (-o) &lt;output&gt;</code>: The directory pathname for 
output.            </li>
@@ -412,7 +424,7 @@ are treated as data points and are re-cl
 <li><code>--startPhase &lt;startPhase&gt;</code> First phase to run.  </li>
 <li><code>--endPhase &lt;endPhase&gt;</code> Last phase to run.   </li>
 </ul>
-<h2 id="references">References</h2>
+<h2 id="references">References<a class="headerlink" href="#references" 
title="Permanent link">&para;</a></h2>
 <ol>
 <li><a href="http://nips.cc/Conferences/2011/Program/event.php?ID=2989"; 
title="M. Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For Large 
Datasets">M. Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For 
Large Datasets</a></li>
 <li><a href="http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf"; 
title="R. Ostrovsky, Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of 
Lloyd-Type Methods for the k-means Problem">R. Ostrovsky, Y. Rabani, L. 
Schulman, Ch. Swamy: The Effectiveness of Lloyd-Type Methods for the k-means 
Problem</a></li>


Modified: 
websites/staging/mahout/trunk/content/users/clustering/viewing-result.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/clustering/viewing-result.html 
(original)
+++ websites/staging/mahout/trunk/content/users/clustering/viewing-result.html 
Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,14 +264,25 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <ul>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<ul>
 <li><a href="#ViewingResult-AlgorithmViewingpages">Algorithm Viewing 
pages</a></li>
 </ul>
 <p>There are various technologies available to view the output of Mahout
 algorithms.
 * Clusters</p>
 <p><a name="ViewingResult-AlgorithmViewingpages"></a></p>
-<h1 id="algorithm-viewing-pages">Algorithm Viewing pages</h1>
+<h1 id="algorithm-viewing-pages">Algorithm Viewing pages<a class="headerlink" 
href="#algorithm-viewing-pages" title="Permanent link">&para;</a></h1>
 <p>{pagetree:root=@self|excerpt=true|expandCollapseAll=true}</p>
    </div>
   </div>     

Modified: 
websites/staging/mahout/trunk/content/users/clustering/viewing-results.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/clustering/viewing-results.html 
(original)
+++ websites/staging/mahout/trunk/content/users/clustering/viewing-results.html 
Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,27 +264,38 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <p><a name="ViewingResults-Intro"></a></p>
-<h1 id="intro">Intro</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="ViewingResults-Intro"></a></p>
+<h1 id="intro">Intro<a class="headerlink" href="#intro" title="Permanent 
link">&para;</a></h1>
 <p>Many of the Mahout libraries run as batch jobs, dumping results into Hadoop
 sequence files or other data structures.  This page is intended to
 demonstrate the various ways one might inspect the outcome of various jobs.
  The page is organized by algorithms.</p>
 <p><a name="ViewingResults-GeneralUtilities"></a></p>
-<h1 id="general-utilities">General Utilities</h1>
+<h1 id="general-utilities">General Utilities<a class="headerlink" 
href="#general-utilities" title="Permanent link">&para;</a></h1>
 <p><a name="ViewingResults-SequenceFileDumper"></a></p>
-<h2 id="sequence-file-dumper">Sequence File Dumper</h2>
+<h2 id="sequence-file-dumper">Sequence File Dumper<a class="headerlink" 
href="#sequence-file-dumper" title="Permanent link">&para;</a></h2>
 <p><a name="ViewingResults-Clustering"></a></p>
-<h1 id="clustering">Clustering</h1>
+<h1 id="clustering">Clustering<a class="headerlink" href="#clustering" 
title="Permanent link">&para;</a></h1>
 <p><a name="ViewingResults-ClusterDumper"></a></p>
-<h2 id="cluster-dumper">Cluster Dumper</h2>
+<h2 id="cluster-dumper">Cluster Dumper<a class="headerlink" 
href="#cluster-dumper" title="Permanent link">&para;</a></h2>
 <p>Run the following to print out all options:</p>
 <div class="codehilite"><pre><span class="n">java</span>  <span 
class="o">-</span><span class="n">cp</span> &quot;<span 
class="o">*</span>&quot; <span class="n">org</span><span 
class="p">.</span><span class="n">apache</span><span class="p">.</span><span 
class="n">mahout</span><span class="p">.</span><span 
class="n">utils</span><span class="p">.</span><span 
class="n">clustering</span><span class="p">.</span><span 
class="n">ClusterDumper</span> <span class="o">--</span><span 
class="n">help</span>
 </pre></div>
 
 
 <p><a name="ViewingResults-Example"></a></p>
-<h3 id="example">Example</h3>
+<h3 id="example">Example<a class="headerlink" href="#example" title="Permanent 
link">&para;</a></h3>
 <div class="codehilite"><pre><span class="n">java</span>  <span 
class="o">-</span><span class="n">cp</span> &quot;<span 
class="o">*</span>&quot; <span class="n">org</span><span 
class="p">.</span><span class="n">apache</span><span class="p">.</span><span 
class="n">mahout</span><span class="p">.</span><span 
class="n">utils</span><span class="p">.</span><span 
class="n">clustering</span><span class="p">.</span><span 
class="n">ClusterDumper</span> <span class="o">--</span><span 
class="n">seqFileDir</span>
 </pre></div>
 
@@ -292,9 +304,9 @@ demonstrate the various ways one might i
           --dictionary ./solr-clust-n2/dictionary.txt
           --substring 100 --pointsDir ./solr-clust-n2/out/points/</p>
 <p><a name="ViewingResults-ClusterLabels(MAHOUT-163)"></a></p>
-<h2 id="cluster-labels-mahout-163">Cluster Labels (MAHOUT-163)</h2>
+<h2 id="cluster-labels-mahout-163">Cluster Labels (MAHOUT-163)<a 
class="headerlink" href="#cluster-labels-mahout-163" title="Permanent 
link">&para;</a></h2>
 <p><a name="ViewingResults-Classification"></a></p>
-<h1 id="classification">Classification</h1>
+<h1 id="classification">Classification<a class="headerlink" 
href="#classification" title="Permanent link">&para;</a></h1>
    </div>
   </div>     
 </div> 

Modified: 
websites/staging/mahout/trunk/content/users/clustering/visualizing-sample-clusters.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/clustering/visualizing-sample-clusters.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/clustering/visualizing-sample-clusters.html
 Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,8 +264,19 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <p><a name="VisualizingSampleClusters-Introduction"></a></p>
-<h1 id="introduction">Introduction</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="VisualizingSampleClusters-Introduction"></a></p>
+<h1 id="introduction">Introduction<a class="headerlink" href="#introduction" 
title="Permanent link">&para;</a></h1>
 <p>Mahout provides examples to visualize sample clusters that gets created by
 our clustering algorithms. Note that the visualization is done by Swing 
programs. You have to be in a window system on the same
 machine you run these, or logged in via a remote desktop.</p>
@@ -272,7 +284,7 @@ machine you run these, or logged in via
 classes under <em>org.apache.mahout.clustering.display</em> package in
 mahout-examples module. The easiest way to achieve this is to <a 
href="users/basics/quickstart.html">setup Mahout</a> in your IDE.</p>
 <p><a name="VisualizingSampleClusters-Visualizingclusters"></a></p>
-<h1 id="visualizing-clusters">Visualizing clusters</h1>
+<h1 id="visualizing-clusters">Visualizing clusters<a class="headerlink" 
href="#visualizing-clusters" title="Permanent link">&para;</a></h1>
 <p>The following classes in <em>org.apache.mahout.clustering.display</em> can 
be run
 without parameters to generate a sample data set and run the reference
 clustering implementations over them:</p>

Modified: 
websites/staging/mahout/trunk/content/users/dim-reduction/dimensional-reduction.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/dim-reduction/dimensional-reduction.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/dim-reduction/dimensional-reduction.html
 Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,7 +264,18 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="support-for-dimensional-reduction">Support for dimensional 
reduction</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="support-for-dimensional-reduction">Support for dimensional reduction<a 
class="headerlink" href="#support-for-dimensional-reduction" title="Permanent 
link">&para;</a></h1>
 <p>Matrix algebra underpins the way many Big Data algorithms and data
 structures are composed: full-text search can be viewed as doing matrix
 multiplication of the term-document matrix by the query vector (giving a
@@ -307,16 +319,16 @@ course, sparse matrices which don't fit
 far as decomposition is concerned. Parallelizable and/or stream-oriented
 algorithms are needed.</p>
 <p><a name="DimensionalReduction-SingularValueDecomposition"></a></p>
-<h1 id="singular-value-decomposition">Singular Value Decomposition</h1>
+<h1 id="singular-value-decomposition">Singular Value Decomposition<a 
class="headerlink" href="#singular-value-decomposition" title="Permanent 
link">&para;</a></h1>
 <p>Currently implemented in Mahout (as of 0.3, the first release with 
MAHOUT-180 applied), are two scalable implementations of SVD, a stream-oriented 
implementation using the Asymmetric Generalized Hebbian Algorithm outlined in 
Genevieve Gorrell &amp; Brandyn Webb's paper (<a 
href="-http://www.dcs.shef.ac.uk/~genevieve/gorrell_webb.pdf.html";>Gorrell and 
Webb 2005</a>
 ); and there is a [Lanczos | http://en.wikipedia.org/wiki/Lanczos_algorithm]
  implementation, both single-threaded, and in the
 o.a.m.math.decomposer.lanczos package (math module), as a hadoop map-reduce
 (series of) job(s) in o.a.m.math.hadoop.decomposer package (core module).
 Coming soon: stochastic decomposition.</p>
-<p>See also: <a 
href="Wikipedia%20-%20SVD">https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition</a></p>
+<p>See also: <a href="Wikipedia - 
SVD">https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition</a></p>
 <p><a name="DimensionalReduction-Lanczos"></a></p>
-<h2 id="lanczos">Lanczos</h2>
+<h2 id="lanczos">Lanczos<a class="headerlink" href="#lanczos" title="Permanent 
link">&para;</a></h2>
 <p>The Lanczos algorithm is designed for eigen-decomposition, but like any
 such algorithm, getting singular vectors out of it is immediate (singular
 vectors of matrix A are just the eigenvectors of A^t * A or A * A^t). 
@@ -344,7 +356,7 @@ via Lanczos, and then discard the bottom
 the largest singular values (which is the case for using Lanczos for
 dimensional reduction).</p>
 <p><a name="DimensionalReduction-ParallelizationStragegy"></a></p>
-<h3 id="parallelization-stragegy">Parallelization Stragegy</h3>
+<h3 id="parallelization-stragegy">Parallelization Stragegy<a 
class="headerlink" href="#parallelization-stragegy" title="Permanent 
link">&para;</a></h3>
 <p>Lanczos is "embarassingly parallelizable": matrix multiplication of a
 matrix by a vector may be carried out row-at-a-time without communication
 until at the end, the results of the intermediate matrix-by-vector outputs
@@ -359,7 +371,7 @@ delaying writing to disk until Mapper cl
 a Combiner be the same as the Reducer, the bottleneck in accumulation is
 nowhere near a single point.</p>
 <p><a name="DimensionalReduction-Mahoutusage"></a></p>
-<h3 id="mahout-usage">Mahout usage</h3>
+<h3 id="mahout-usage">Mahout usage<a class="headerlink" href="#mahout-usage" 
title="Permanent link">&para;</a></h3>
 <p>The Mahout DistributedLanzcosSolver is invoked by the
 <MAHOUT_HOME>/bin/mahout svd command. This command takes the following
 arguments (which can be reproduced by just entering the command with no
@@ -456,7 +468,7 @@ the long form svd invocation:</p>
 <p>TODO: also allow exclusion based on improper orthogonality (currently
 computed, but not checked against constraints).</p>
 <p><a 
name="DimensionalReduction-Example:SVDofASFMailArchivesonAmazonElasticMapReduce"></a></p>
-<h4 id="example-svd-of-asf-mail-archives-on-amazon-elastic-mapreduce">Example: 
SVD of ASF Mail Archives on Amazon Elastic MapReduce</h4>
+<h4 id="example-svd-of-asf-mail-archives-on-amazon-elastic-mapreduce">Example: 
SVD of ASF Mail Archives on Amazon Elastic MapReduce<a class="headerlink" 
href="#example-svd-of-asf-mail-archives-on-amazon-elastic-mapreduce" 
title="Permanent link">&para;</a></h4>
 <p>This section walks you through a complete example of running the Mahout SVD
 job on Amazon Elastic MapReduce cluster and then preparing the output to be
 used for clustering. This example was developed as part of the effort to
@@ -479,7 +491,7 @@ mailing list, see: <a href="http://searc
 <p>Note: Some of this work is due in part to credits donated by the Amazon
 Elastic MapReduce team.</p>
 <p><a name="DimensionalReduction-1.LaunchEMRCluster"></a></p>
-<h5 id="1-launch-emr-cluster">1. Launch EMR Cluster</h5>
+<h5 id="1-launch-emr-cluster">1. Launch EMR Cluster<a class="headerlink" 
href="#1-launch-emr-cluster" title="Permanent link">&para;</a></h5>
 <p>For a detailed explanation of the steps involved in launching an Amazon
 Elastic MapReduce cluster for running Mahout jobs, please read the
 "Building Vectors for Large Document Sets" section of <a 
href="https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce";>Mahout
 on Elastic MapReduce</a>
@@ -487,11 +499,11 @@ Elastic MapReduce cluster for running Ma
 <p>In the remaining steps below, remember to replace JOB_ID with the Job ID of
 your EMR cluster.</p>
 <p><a name="DimensionalReduction-2.LoadMahout0.5+JARintoS3"></a></p>
-<h5 id="2-load-mahout-05-jar-into-s3">2. Load Mahout 0.5+ JAR into S3</h5>
+<h5 id="2-load-mahout-05-jar-into-s3">2. Load Mahout 0.5+ JAR into S3<a 
class="headerlink" href="#2-load-mahout-05-jar-into-s3" title="Permanent 
link">&para;</a></h5>
 <p>These steps were created with the mahout-0.5-SNAPSHOT because they rely on
 the patch for <a 
href="https://issues.apache.org/jira/browse/MAHOUT-639";>MAHOUT-639</a></p>
 <p><a name="DimensionalReduction-3.CopyTFIDFVectorsintoHDFS"></a></p>
-<h5 id="3-copy-tfidf-vectors-into-hdfs">3. Copy TFIDF Vectors into HDFS</h5>
+<h5 id="3-copy-tfidf-vectors-into-hdfs">3. Copy TFIDF Vectors into HDFS<a 
class="headerlink" href="#3-copy-tfidf-vectors-into-hdfs" title="Permanent 
link">&para;</a></h5>
 <p>Before running your SVD job on the vectors, you need to copy them from S3
 to your EMR cluster's HDFS.</p>
 <div class="codehilite"><pre><span class="n">elastic</span><span 
class="o">-</span><span class="n">mapreduce</span> <span 
class="o">--</span><span class="n">jar</span> <span class="n">s3</span><span 
class="p">:</span><span class="o">//</span><span 
class="n">elasticmapreduce</span><span class="o">/</span><span 
class="n">samples</span><span class="o">/</span><span 
class="n">distcp</span><span class="o">/</span><span 
class="n">distcp</span><span class="p">.</span><span class="n">jar</span> <span 
class="o">\</span>
@@ -502,7 +514,7 @@ to your EMR cluster's HDFS.</p>
 
 
 <p><a name="DimensionalReduction-4.RuntheSVDJob"></a></p>
-<h5 id="4-run-the-svd-job">4. Run the SVD Job</h5>
+<h5 id="4-run-the-svd-job">4. Run the SVD Job<a class="headerlink" 
href="#4-run-the-svd-job" title="Permanent link">&para;</a></h5>
 <p>Now you're ready to run the SVD job on the vectors stored in HDFS:</p>
 <div class="codehilite"><pre><span class="n">elastic</span><span 
class="o">-</span><span class="n">mapreduce</span> <span 
class="o">--</span><span class="n">jar</span> <span class="n">s3</span><span 
class="p">:</span><span class="o">//</span><span class="n">BUCKET</span><span 
class="o">/</span><span class="n">mahout</span><span class="o">-</span><span 
class="n">examples</span><span class="o">-</span>0<span 
class="p">.</span>5<span class="o">-</span><span class="n">SNAPSHOT</span><span 
class="o">-</span><span class="n">job</span><span class="p">.</span><span 
class="n">jar</span> <span class="o">\</span>
   <span class="o">--</span><span class="n">main</span><span 
class="o">-</span><span class="n">class</span> <span class="n">org</span><span 
class="p">.</span><span class="n">apache</span><span class="p">.</span><span 
class="n">mahout</span><span class="p">.</span><span 
class="n">driver</span><span class="p">.</span><span 
class="n">MahoutDriver</span> <span class="o">\</span>
@@ -528,7 +540,7 @@ removes any duplicate eigenvectors cause
 overflow and any that don't appear to be "eigen" enough (ie, they don't
 satisfy the eigenvector criterion with high enough fidelity). - Jake Mannix</p>
 <p><a 
name="DimensionalReduction-5.TransformyourTFIDFVectorsintoMahoutMatrix"></a></p>
-<h5 id="5-transform-your-tfidf-vectors-into-mahout-matrix">5. Transform your 
TFIDF Vectors into Mahout Matrix</h5>
+<h5 id="5-transform-your-tfidf-vectors-into-mahout-matrix">5. Transform your 
TFIDF Vectors into Mahout Matrix<a class="headerlink" 
href="#5-transform-your-tfidf-vectors-into-mahout-matrix" title="Permanent 
link">&para;</a></h5>
 <p>The tfidf vectors created by the seq2sparse job are
 SequenceFile<Text,VectorWritable>. The Mahout RowId job transforms these
 vectors into a matrix form that is a
@@ -558,7 +570,7 @@ your EMR cluster. The job produces the f
 <p>where docIndex is the SequenceFile<IntWritable,Text> and matrix is
 SequenceFile<IntWritable,VectorWritable>.</p>
 <p><a name="DimensionalReduction-6.TransposetheMatrix"></a></p>
-<h5 id="6-transpose-the-matrix">6. Transpose the Matrix</h5>
+<h5 id="6-transpose-the-matrix">6. Transpose the Matrix<a class="headerlink" 
href="#6-transpose-the-matrix" title="Permanent link">&para;</a></h5>
 <p>Our ultimate goal is to multiply the TFIDF vector matrix times our SVD
 eigenvectors. For the mathematically inclined, from the rowid job, we now
 have an m x n matrix T (m=6076937, n=20444). The SVD eigenvector matrix E
@@ -598,7 +610,7 @@ numColsZ == numColsX). - Jake Mannix</p>
 
 
 <p><a name="DimensionalReduction-7.TransposeEigenvectors"></a></p>
-<h5 id="7-transpose-eigenvectors">7. Transpose Eigenvectors</h5>
+<h5 id="7-transpose-eigenvectors">7. Transpose Eigenvectors<a 
class="headerlink" href="#7-transpose-eigenvectors" title="Permanent 
link">&para;</a></h5>
 <p>If you followed Jake's explanation in step 6 above, then you know that we
 also need to transpose the eigenvectors:</p>
 <div class="codehilite"><pre><span class="n">elastic</span><span 
class="o">-</span><span class="n">mapreduce</span> <span 
class="o">--</span><span class="n">jar</span> <span class="n">s3</span><span 
class="p">:</span><span class="o">//</span><span class="n">BUCKET</span><span 
class="o">/</span><span class="n">mahout</span><span class="o">-</span><span 
class="n">examples</span><span class="o">-</span>0<span 
class="p">.</span>5<span class="o">-</span><span class="n">SNAPSHOT</span><span 
class="o">-</span><span class="n">job</span><span class="p">.</span><span 
class="n">jar</span> <span class="o">\</span>
@@ -620,7 +632,7 @@ transposing the matrix you are multiplyi
 
 
 <p><a name="DimensionalReduction-8.MatrixMultiplication"></a></p>
-<h5 id="8-matrix-multiplication">8. Matrix Multiplication</h5>
+<h5 id="8-matrix-multiplication">8. Matrix Multiplication<a class="headerlink" 
href="#8-matrix-multiplication" title="Permanent link">&para;</a></h5>
 <p>Lastly, we need to multiply the transposed vectors using Mahout's
 matrixmult job:</p>
 <div class="codehilite"><pre><span class="n">elastic</span><span 
class="o">-</span><span class="n">mapreduce</span> <span 
class="o">--</span><span class="n">jar</span> <span class="n">s3</span><span 
class="p">:</span><span class="o">//</span><span class="n">BUCKET</span><span 
class="o">/</span><span class="n">mahout</span><span class="o">-</span><span 
class="n">examples</span><span class="o">-</span>0<span 
class="p">.</span>5<span class="o">-</span><span class="n">SNAPSHOT</span><span 
class="o">-</span><span class="n">job</span><span class="p">.</span><span 
class="n">jar</span> <span class="o">\</span>
@@ -643,7 +655,7 @@ matrixmult job:</p>
 
 
 <p><a name="DimensionalReduction-Resources"></a></p>
-<h1 id="resources">Resources</h1>
+<h1 id="resources">Resources<a class="headerlink" href="#resources" 
title="Permanent link">&para;</a></h1>
 <ul>
 <li><a href="http://www.dcs.shef.ac.uk/~genevieve/lsa_tutorial.htm";>LSA 
tutorial</a></li>
 <li><a 
href="http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html";>SVD
 tutorial</a></li>

Modified: websites/staging/mahout/trunk/content/users/dim-reduction/ssvd.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/dim-reduction/ssvd.html 
(original)
+++ websites/staging/mahout/trunk/content/users/dim-reduction/ssvd.html Fri Apr 
 8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,10 +264,21 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="stochastic-singular-value-decomposition">Stochastic Singular Value 
Decomposition</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="stochastic-singular-value-decomposition">Stochastic Singular Value 
Decomposition<a class="headerlink" 
href="#stochastic-singular-value-decomposition" title="Permanent 
link">&para;</a></h1>
 <p>Stochastic SVD method in Mahout produces reduced rank Singular Value 
Decomposition output in its 
 strict mathematical definition: <code>\(\mathbf{A\approx 
U}\boldsymbol{\Sigma}\mathbf{V}^{\top}\)</code>.</p>
-<h2 id="the-benefits-over-other-methods-are">The benefits over other methods 
are:</h2>
+<h2 id="the-benefits-over-other-methods-are">The benefits over other methods 
are:<a class="headerlink" href="#the-benefits-over-other-methods-are" 
title="Permanent link">&para;</a></h2>
 <ul>
 <li>
 <p>reduced flops required compared to Krylov subspace methods</p>
@@ -284,14 +296,14 @@ strict mathematical definition: <code>\(
 <p>As of 0.7 trunk, includes PCA and dimensionality reduction workflow 
(EXPERIMENTAL! Feedback on performance/other PCA related issues/ blogs is 
greatly appreciated.)</p>
 </li>
 </ul>
-<h3 id="map-reduce-characteristics">Map-Reduce characteristics:</h3>
+<h3 id="map-reduce-characteristics">Map-Reduce characteristics:<a 
class="headerlink" href="#map-reduce-characteristics" title="Permanent 
link">&para;</a></h3>
 <p>SSVD uses at most 3 MR sequential steps (map-only + map-reduce + 2 optional 
parallel map-reduce jobs) to produce reduced rank approximation of U, V and S 
matrices. Additionally, two more map-reduce steps are added for each power 
iteration step if requested.</p>
-<h2 id="potential-drawbacks">Potential drawbacks:</h2>
+<h2 id="potential-drawbacks">Potential drawbacks:<a class="headerlink" 
href="#potential-drawbacks" title="Permanent link">&para;</a></h2>
 <p>potentially less precise (but adding even one power iteration seems to fix 
that quite a bit).</p>
-<h2 id="documentation">Documentation</h2>
+<h2 id="documentation">Documentation<a class="headerlink" 
href="#documentation" title="Permanent link">&para;</a></h2>
 <p><a href="ssvd.page/SSVD-CLI.pdf">Overview and Usage</a></p>
 <p>Note: Please use 0.6 or later! for PCA workflow, please use 0.7 or 
later.</p>
-<h2 id="publications">Publications</h2>
+<h2 id="publications">Publications<a class="headerlink" href="#publications" 
title="Permanent link">&para;</a></h2>
 <p><a 
href="http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf";>Nathan
 Halko's dissertation</a> "Randomized methods for computing low-rank
 approximations of matrices" contains comprehensive definition of 
parallelization strategy taken in Mahout SSVD implementation and also some 
precision/scalability benchmarks, esp. w.r.t. Mahout Lanczos implementation on 
a typical corpus data set.</p>
 <p><a href="http://arxiv.org/abs/0909.4061";>Halko, Martinsson, Tropp</a> paper 
discusses family of random projection-based algorithms and contains theoretical 
error estimates.</p>
@@ -318,7 +330,7 @@ x<span class="o">&lt;-</span> usim <span
 
 <p>and try to compare ssvd.svd(x) and stock svd(x) performance for the same 
rank k, notice the difference in the running time. Also play with power 
iterations (qIter) and compare accuracies of standard svd and SSVD.</p>
 <p>Note: numerical stability of R algorithms may differ from that of Mahout's 
distributed version. We haven't studied accuracy of the R simulation. For study 
of accuracy of Mahout's version, please refer to Nathan's dissertation as 
referenced above.</p>
-<h4 id="modified-ssvd-algorithm">Modified SSVD Algorithm.</h4>
+<h4 id="modified-ssvd-algorithm">Modified SSVD Algorithm.<a class="headerlink" 
href="#modified-ssvd-algorithm" title="Permanent link">&para;</a></h4>
 <p>Given an <code>\(m\times n\)</code>
 matrix <code>\(\mathbf{A}\)</code>, a target rank 
<code>\(k\in\mathbb{N}_{1}\)</code>
 , an oversampling parameter <code>\(p\in\mathbb{N}_{1}\)</code>, 

Modified: 
websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html
 Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,11 +264,22 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="building-a-text-classifier-in-mahouts-spark-shell">Building a text 
classifier in Mahout's Spark Shell</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="building-a-text-classifier-in-mahouts-spark-shell">Building a text 
classifier in Mahout's Spark Shell<a class="headerlink" 
href="#building-a-text-classifier-in-mahouts-spark-shell" title="Permanent 
link">&para;</a></h1>
 <p>This tutorial will take you through the steps used to train a Multinomial 
Naive Bayes model and create a text classifier based on that model using the 
<code>mahout spark-shell</code>. </p>
-<h2 id="prerequisites">Prerequisites</h2>
+<h2 id="prerequisites">Prerequisites<a class="headerlink" 
href="#prerequisites" title="Permanent link">&para;</a></h2>
 <p>This tutorial assumes that you have your Spark environment variables set 
for the <code>mahout spark-shell</code> see: <a 
href="http://mahout.apache.org/users/sparkbindings/play-with-shell.html";>Playing
 with Mahout's Shell</a>.  As well we assume that Mahout is running in cluster 
mode (i.e. with the <code>MAHOUT_LOCAL</code> environment variable 
<strong>unset</strong>) as we'll be reading and writing to HDFS.</p>
-<h2 id="downloading-and-vectorizing-the-wikipedia-dataset">Downloading and 
Vectorizing the Wikipedia dataset</h2>
+<h2 id="downloading-and-vectorizing-the-wikipedia-dataset">Downloading and 
Vectorizing the Wikipedia dataset<a class="headerlink" 
href="#downloading-and-vectorizing-the-wikipedia-dataset" title="Permanent 
link">&para;</a></h2>
 <p><em>As of Mahout v. 0.10.0, we are still reliant on the MapReduce versions 
of <code>mahout seqwiki</code> and <code>mahout seq2sparse</code> to extract 
and vectorize our text.  A</em> <a 
href="https://issues.apache.org/jira/browse/MAHOUT-1663";><em>Spark 
implementation of seq2sparse</em></a> <em>is in the works for Mahout v. 
0.11.</em> However, to download the Wikipedia dataset, extract the bodies of 
the documentation, label each document and vectorize the text into TF-IDF 
vectors, we can simpmly run the <a 
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh";>wikipedia-classifier.sh</a>
 example.  </p>
 <div class="codehilite"><pre><span class="n">Please</span> <span 
class="n">select</span> <span class="n">a</span> <span class="n">number</span> 
<span class="n">to</span> <span class="n">choose</span> <span 
class="n">the</span> <span class="n">corresponding</span> <span 
class="n">task</span> <span class="n">to</span> <span class="n">run</span>
 1<span class="p">.</span> <span class="n">CBayes</span> <span 
class="p">(</span><span class="n">may</span> <span class="n">require</span> 
<span class="n">increased</span> <span class="n">heap</span> <span 
class="n">space</span> <span class="n">on</span> <span 
class="n">yarn</span><span class="p">)</span>
@@ -278,14 +290,14 @@
 
 
 <p>Enter (2). This will download a large recent XML dump of the Wikipedia 
database, into a <code>/tmp/mahout-work-wiki</code> directory, unzip it and  
place it into HDFS.  It will run a <a 
href="http://mahout.apache.org/users/classification/wikipedia-classifier-example.html";>MapReduce
 job to parse the wikipedia set</a>, extracting and labeling only pages with 
category tags for [United States] and [United Kingdom] (~11600 documents). It 
will then run <code>mahout seq2sparse</code> to convert the documents into 
TF-IDF vectors.  The script will also a build and test a <a 
href="http://mahout.apache.org/users/classification/bayesian.html";>Naive Bayes 
model using MapReduce</a>.  When it is completed, you should see a confusion 
matrix on your screen.  For this tutorial, we will ignore the MapReduce model, 
and build a new model using Spark based on the vectorized text output by 
<code>seq2sparse</code>.</p>
-<h2 id="getting-started">Getting Started</h2>
+<h2 id="getting-started">Getting Started<a class="headerlink" 
href="#getting-started" title="Permanent link">&para;</a></h2>
 <p>Launch the <code>mahout spark-shell</code>.  There is an example script: 
<code>spark-document-classifier.mscala</code> (.mscala denotes a Mahout-Scala 
script which can be run similarly to an R script).   We will be walking through 
this script for this tutorial but if you wanted to simply run the script, you 
could just issue the command: </p>
 <div class="codehilite"><pre><span class="n">mahout</span><span 
class="o">&gt;</span> <span class="p">:</span><span class="n">load</span> <span 
class="o">/</span><span class="n">path</span><span class="o">/</span><span 
class="n">to</span><span class="o">/</span><span class="n">mahout</span><span 
class="o">/</span><span class="n">examples</span><span class="o">/</span><span 
class="n">bin</span><span class="o">/</span><span class="n">spark</span><span 
class="o">-</span><span class="n">document</span><span class="o">-</span><span 
class="n">classifier</span><span class="p">.</span><span class="n">mscala</span>
 </pre></div>
 
 
 <p>For now, lets take the script apart piece by piece.  You can cut and paste 
the following code blocks into the <code>mahout spark-shell</code>.</p>
-<h2 id="imports">Imports</h2>
+<h2 id="imports">Imports<a class="headerlink" href="#imports" title="Permanent 
link">&para;</a></h2>
 <p>Our Mahout Naive Bayes imports:</p>
 <div class="codehilite"><pre><span class="n">import</span> <span 
class="n">org</span><span class="p">.</span><span class="n">apache</span><span 
class="p">.</span><span class="n">mahout</span><span class="p">.</span><span 
class="n">classifier</span><span class="p">.</span><span 
class="n">naivebayes</span><span class="p">.</span><span class="n">_</span>
 <span class="n">import</span> <span class="n">org</span><span 
class="p">.</span><span class="n">apache</span><span class="p">.</span><span 
class="n">mahout</span><span class="p">.</span><span 
class="n">classifier</span><span class="p">.</span><span 
class="n">stats</span><span class="p">.</span><span class="n">_</span>
@@ -300,19 +312,19 @@
 </pre></div>
 
 
-<h2 
id="read-in-our-full-set-from-hdfs-as-vectorized-by-seq2sparse-in-classify-wikipediash">Read
 in our full set from HDFS as vectorized by seq2sparse in 
classify-wikipedia.sh</h2>
+<h2 
id="read-in-our-full-set-from-hdfs-as-vectorized-by-seq2sparse-in-classify-wikipediash">Read
 in our full set from HDFS as vectorized by seq2sparse in 
classify-wikipedia.sh<a class="headerlink" 
href="#read-in-our-full-set-from-hdfs-as-vectorized-by-seq2sparse-in-classify-wikipediash"
 title="Permanent link">&para;</a></h2>
 <div class="codehilite"><pre><span class="n">val</span> <span 
class="n">pathToData</span> <span class="p">=</span> &quot;<span 
class="o">/</span><span class="n">tmp</span><span class="o">/</span><span 
class="n">mahout</span><span class="o">-</span><span class="n">work</span><span 
class="o">-</span><span class="n">wiki</span><span class="o">/</span>&quot;
 <span class="n">val</span> <span class="n">fullData</span> <span 
class="p">=</span> <span class="n">drmDfsRead</span><span 
class="p">(</span><span class="n">pathToData</span> <span class="o">+</span> 
&quot;<span class="n">wikipediaVecs</span><span class="o">/</span><span 
class="n">tfidf</span><span class="o">-</span><span 
class="n">vectors</span>&quot;<span class="p">)</span>
 </pre></div>
 
 
-<h2 
id="extract-the-category-of-each-observation-and-aggregate-those-observations-by-category">Extract
 the category of each observation and aggregate those observations by 
category</h2>
+<h2 
id="extract-the-category-of-each-observation-and-aggregate-those-observations-by-category">Extract
 the category of each observation and aggregate those observations by 
category<a class="headerlink" 
href="#extract-the-category-of-each-observation-and-aggregate-those-observations-by-category"
 title="Permanent link">&para;</a></h2>
 <div class="codehilite"><pre><span class="n">val</span> <span 
class="p">(</span><span class="n">labelIndex</span><span class="p">,</span> 
<span class="n">aggregatedObservations</span><span class="p">)</span> <span 
class="p">=</span> <span class="n">SparkNaiveBayes</span><span 
class="p">.</span><span 
class="n">extractLabelsAndAggregateObservations</span><span class="p">(</span>
                                                              <span 
class="n">fullData</span><span class="p">)</span>
 </pre></div>
 
 
-<h2 
id="build-a-muitinomial-naive-bayes-model-and-self-test-on-the-training-set">Build
 a Muitinomial Naive Bayes model and self test on the training set</h2>
+<h2 
id="build-a-muitinomial-naive-bayes-model-and-self-test-on-the-training-set">Build
 a Muitinomial Naive Bayes model and self test on the training set<a 
class="headerlink" 
href="#build-a-muitinomial-naive-bayes-model-and-self-test-on-the-training-set" 
title="Permanent link">&para;</a></h2>
 <div class="codehilite"><pre><span class="n">val</span> <span 
class="n">model</span> <span class="p">=</span> <span 
class="n">SparkNaiveBayes</span><span class="p">.</span><span 
class="n">train</span><span class="p">(</span><span 
class="n">aggregatedObservations</span><span class="p">,</span> <span 
class="n">labelIndex</span><span class="p">,</span> <span 
class="n">false</span><span class="p">)</span>
 <span class="n">val</span> <span class="n">resAnalyzer</span> <span 
class="p">=</span> <span class="n">SparkNaiveBayes</span><span 
class="p">.</span><span class="n">test</span><span class="p">(</span><span 
class="n">model</span><span class="p">,</span> <span 
class="n">fullData</span><span class="p">,</span> <span 
class="n">false</span><span class="p">)</span>
 <span class="n">println</span><span class="p">(</span><span 
class="n">resAnalyzer</span><span class="p">)</span>
@@ -320,7 +332,7 @@
 
 
 <p>printing the <code>ResultAnalyzer</code> will display the confusion 
matrix.</p>
-<h2 id="read-in-the-dictionary-and-document-frequency-count-from-hdfs">Read in 
the dictionary and document frequency count from HDFS</h2>
+<h2 id="read-in-the-dictionary-and-document-frequency-count-from-hdfs">Read in 
the dictionary and document frequency count from HDFS<a class="headerlink" 
href="#read-in-the-dictionary-and-document-frequency-count-from-hdfs" 
title="Permanent link">&para;</a></h2>
 <div class="codehilite"><pre><span class="n">val</span> <span 
class="n">dictionary</span> <span class="p">=</span> <span 
class="n">sdc</span><span class="p">.</span><span 
class="n">sequenceFile</span><span class="p">(</span><span 
class="n">pathToData</span> <span class="o">+</span> &quot;<span 
class="n">wikipediaVecs</span><span class="o">/</span><span 
class="n">dictionary</span><span class="p">.</span><span 
class="n">file</span><span class="o">-</span>0&quot;<span class="p">,</span>
                                   <span class="n">classOf</span><span 
class="p">[</span><span class="n">Text</span><span class="p">],</span>
                                   <span class="n">classOf</span><span 
class="p">[</span><span class="n">IntWritable</span><span class="p">])</span>
@@ -344,7 +356,7 @@
 </pre></div>
 
 
-<h2 
id="define-a-function-to-tokenize-and-vectorize-new-text-using-our-current-dictionary">Define
 a function to tokenize and vectorize new text using our current dictionary</h2>
+<h2 
id="define-a-function-to-tokenize-and-vectorize-new-text-using-our-current-dictionary">Define
 a function to tokenize and vectorize new text using our current dictionary<a 
class="headerlink" 
href="#define-a-function-to-tokenize-and-vectorize-new-text-using-our-current-dictionary"
 title="Permanent link">&para;</a></h2>
 <p>For this simple example, our function <code>vectorizeDocument(...)</code> 
will tokenize a new document into unigrams using native Java String methods and 
vectorize using our dictionary and document frequencies. You could also use a 
<a href="https://lucene.apache.org/core/";>Lucene</a> analyzer for bigrams, 
trigrams, etc., and integrate Apache <a 
href="https://tika.apache.org/";>Tika</a> to extract text from different 
document types (PDF, PPT, XLS, etc.).  Here, however we will keep it simple, 
stripping and tokenizing our text using regexs and native String methods.</p>
 <div class="codehilite"><pre>def vectorizeDocument<span 
class="p">(</span>document: String<span class="p">,</span>
                         dictionaryMap: Map<span class="p">[</span>String<span 
class="p">,</span>Int<span class="p">],</span>
@@ -376,7 +388,7 @@
 </pre></div>
 
 
-<h2 id="setup-our-classifier">Setup our classifier</h2>
+<h2 id="setup-our-classifier">Setup our classifier<a class="headerlink" 
href="#setup-our-classifier" title="Permanent link">&para;</a></h2>
 <div class="codehilite"><pre><span class="n">val</span> <span 
class="n">labelMap</span> <span class="p">=</span> <span 
class="n">model</span><span class="p">.</span><span class="n">labelIndex</span>
 <span class="n">val</span> <span class="n">numLabels</span> <span 
class="p">=</span> <span class="n">model</span><span class="p">.</span><span 
class="n">numLabels</span>
 <span class="n">val</span> <span class="n">reverseLabelMap</span> <span 
class="p">=</span> <span class="n">labelMap</span><span class="p">.</span><span 
class="n">map</span><span class="p">(</span><span class="n">x</span> <span 
class="p">=</span><span class="o">&gt;</span> <span class="n">x</span><span 
class="p">.</span><span class="n">_2</span> <span class="o">-&gt;</span> <span 
class="n">x</span><span class="p">.</span><span class="n">_1</span><span 
class="p">)</span>
@@ -389,7 +401,7 @@
 </pre></div>
 
 
-<h2 id="define-an-argmax-function">Define an argmax function</h2>
+<h2 id="define-an-argmax-function">Define an argmax function<a 
class="headerlink" href="#define-an-argmax-function" title="Permanent 
link">&para;</a></h2>
 <p>The label with the highest score wins the classification for a given 
document.</p>
 <div class="codehilite"><pre>def argmax<span class="p">(</span>v: Vector<span 
class="p">)</span>: <span class="p">(</span>Int<span class="p">,</span> 
Double<span class="p">)</span> <span class="o">=</span> <span class="p">{</span>
     var bestIdx: Int <span class="o">=</span> Integer.MIN_VALUE
@@ -405,7 +417,7 @@
 </pre></div>
 
 
-<h2 id="define-our-tf-idf-vector-classifier">Define our TF(-IDF) vector 
classifier</h2>
+<h2 id="define-our-tf-idf-vector-classifier">Define our TF(-IDF) vector 
classifier<a class="headerlink" href="#define-our-tf-idf-vector-classifier" 
title="Permanent link">&para;</a></h2>
 <div class="codehilite"><pre><span class="n">def</span> <span 
class="n">classifyDocument</span><span class="p">(</span><span 
class="n">clvec</span><span class="p">:</span> <span 
class="n">Vector</span><span class="p">)</span> <span class="p">:</span> <span 
class="n">String</span> <span class="p">=</span> <span class="p">{</span>
     <span class="n">val</span> <span class="n">cvec</span> <span 
class="p">=</span> <span class="n">classifier</span><span 
class="p">.</span><span class="n">classifyFull</span><span 
class="p">(</span><span class="n">clvec</span><span class="p">)</span>
     <span class="n">val</span> <span class="p">(</span><span 
class="n">bestIdx</span><span class="p">,</span> <span 
class="n">bestScore</span><span class="p">)</span> <span class="p">=</span> 
<span class="n">argmax</span><span class="p">(</span><span 
class="n">cvec</span><span class="p">)</span>
@@ -414,7 +426,7 @@
 </pre></div>
 
 
-<h2 
id="two-sample-news-articles-united-states-football-and-united-kingdom-football">Two
 sample news articles: United States Football and United Kingdom Football</h2>
+<h2 
id="two-sample-news-articles-united-states-football-and-united-kingdom-football">Two
 sample news articles: United States Football and United Kingdom Football<a 
class="headerlink" 
href="#two-sample-news-articles-united-states-football-and-united-kingdom-football"
 title="Permanent link">&para;</a></h2>
 <div class="codehilite"><pre><span class="c1">// A random United States 
football article</span>
 <span class="c1">// 
http://www.reuters.com/article/2015/01/28/us-nfl-superbowl-security-idUSKBN0L12JR20150128</span>
 <span class="n">val</span> <span class="n">UStextToClassify</span> <span 
class="o">=</span> <span class="k">new</span> <span 
class="n">String</span><span class="p">(</span><span class="s">&quot;(Reuters) 
- Super Bowl security officials acknowledge&quot;</span> <span 
class="o">+</span>
@@ -483,7 +495,7 @@
 </pre></div>
 
 
-<h2 id="vectorize-and-classify-our-documents">Vectorize and classify our 
documents</h2>
+<h2 id="vectorize-and-classify-our-documents">Vectorize and classify our 
documents<a class="headerlink" href="#vectorize-and-classify-our-documents" 
title="Permanent link">&para;</a></h2>
 <div class="codehilite"><pre><span class="n">val</span> <span 
class="n">usVec</span> <span class="p">=</span> <span 
class="n">vectorizeDocument</span><span class="p">(</span><span 
class="n">UStextToClassify</span><span class="p">,</span> <span 
class="n">dictionaryMap</span><span class="p">,</span> <span 
class="n">dfCountMap</span><span class="p">)</span>
 <span class="n">val</span> <span class="n">ukVec</span> <span 
class="p">=</span> <span class="n">vectorizeDocument</span><span 
class="p">(</span><span class="n">UKtextToClassify</span><span 
class="p">,</span> <span class="n">dictionaryMap</span><span class="p">,</span> 
<span class="n">dfCountMap</span><span class="p">)</span>
 
@@ -495,7 +507,7 @@
 </pre></div>
 
 
-<h2 id="tie-everything-together-in-a-new-method-to-classify-text">Tie 
everything together in a new method to classify text</h2>
+<h2 id="tie-everything-together-in-a-new-method-to-classify-text">Tie 
everything together in a new method to classify text<a class="headerlink" 
href="#tie-everything-together-in-a-new-method-to-classify-text" 
title="Permanent link">&para;</a></h2>
 <div class="codehilite"><pre><span class="n">def</span> <span 
class="n">classifyText</span><span class="p">(</span><span 
class="n">txt</span><span class="p">:</span> <span class="n">String</span><span 
class="p">):</span> <span class="n">String</span> <span class="p">=</span> 
<span class="p">{</span>
     <span class="n">val</span> <span class="n">v</span> <span 
class="p">=</span> <span class="n">vectorizeDocument</span><span 
class="p">(</span><span class="n">txt</span><span class="p">,</span> <span 
class="n">dictionaryMap</span><span class="p">,</span> <span 
class="n">dfCountMap</span><span class="p">)</span>
     <span class="n">classifyDocument</span><span class="p">(</span><span 
class="n">v</span><span class="p">)</span>
@@ -503,13 +515,13 @@
 </pre></div>
 
 
-<h2 id="now-we-can-simply-call-our-classifytext-method-on-any-string">Now we 
can simply call our classifyText(...) method on any String</h2>
+<h2 id="now-we-can-simply-call-our-classifytext-method-on-any-string">Now we 
can simply call our classifyText(...) method on any String<a class="headerlink" 
href="#now-we-can-simply-call-our-classifytext-method-on-any-string" 
title="Permanent link">&para;</a></h2>
 <div class="codehilite"><pre><span class="n">classifyText</span><span 
class="p">(</span>&quot;<span class="n">Hello</span> <span 
class="n">world</span> <span class="n">from</span> <span 
class="n">Queens</span>&quot;<span class="p">)</span>
 <span class="n">classifyText</span><span class="p">(</span>&quot;<span 
class="n">Hello</span> <span class="n">world</span> <span class="n">from</span> 
<span class="n">London</span>&quot;<span class="p">)</span>
 </pre></div>
 
 
-<h2 id="model-persistance">Model persistance</h2>
+<h2 id="model-persistance">Model persistance<a class="headerlink" 
href="#model-persistance" title="Permanent link">&para;</a></h2>
 <p>You can save the model to HDFS:</p>
 <div class="codehilite"><pre><span class="n">model</span><span 
class="p">.</span><span class="n">dfsWrite</span><span 
class="p">(</span>&quot;<span class="o">/</span><span 
class="n">path</span><span class="o">/</span><span class="n">to</span><span 
class="o">/</span><span class="n">model</span>&quot;<span class="p">)</span>
 </pre></div>

Modified: 
websites/staging/mahout/trunk/content/users/environment/h2o-internals.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/environment/h2o-internals.html 
(original)
+++ websites/staging/mahout/trunk/content/users/environment/h2o-internals.html 
Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,18 +264,29 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="introduction">Introduction</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="introduction">Introduction<a class="headerlink" href="#introduction" 
title="Permanent link">&para;</a></h1>
 <p>This document provides an overview of how the Mahout Samsara environment is 
implemented over the H2O backend engine. The document is aimed at Mahout 
developers, to give a high level description of the design so that one can 
explore the code inside <code>h2o/</code> with some context.</p>
-<h2 id="h2o-overview">H2O Overview</h2>
+<h2 id="h2o-overview">H2O Overview<a class="headerlink" href="#h2o-overview" 
title="Permanent link">&para;</a></h2>
 <p>H2O is a distributed scalable machine learning system. Internal 
architecture of H2O has a distributed math engine (h2o-core) and a separate 
layer on top for algorithms and UI. The Mahout integration requires only the 
math engine (h2o-core).</p>
-<h2 id="h2o-data-model">H2O Data Model</h2>
+<h2 id="h2o-data-model">H2O Data Model<a class="headerlink" 
href="#h2o-data-model" title="Permanent link">&para;</a></h2>
 <p>The data model of the H2O math engine is a distributed columnar store (of 
primarily numbers, but also strings). A column of numbers is called a Vector, 
which is broken into Chunks (of a few thousand elements). Chunks are 
distributed across the cluster based on a deterministic hash. Therefore, any 
member of the cluster knows where a particular Chunk of a Vector is homed. Each 
Chunk is separately compressed in memory and elements are individually 
decompressed on the fly upon access with purely register operations (thereby 
achieving high memory throughput). An ordered set of similarly partitioned Vecs 
are composed into a Frame. A Frame is therefore a large two dimensional table 
of numbers. All elements of a logical row in the Frame are guaranteed to be 
homed in the same server of the cluster. Generally speaking, H2O works well on 
"tall skinny" data, i.e, lots of rows (100s of millions) and modest number of 
columns (10s of thousands).</p>
-<h2 id="mahout-drm">Mahout DRM</h2>
+<h2 id="mahout-drm">Mahout DRM<a class="headerlink" href="#mahout-drm" 
title="Permanent link">&para;</a></h2>
 <p>The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a 
large matrix of numbers in-memory in a cluster by distributing logical rows 
among servers. Mahout's scala DSL provides an abstract API on DRMs for backend 
engines to provide implementations of this API. Examples are the Spark and H2O 
backend engines. Each engine has it's own design of mapping the abstract API 
onto its data model and provides implementations for algebraic operators over 
that mapping.</p>
-<h2 id="h2o-environment-engine">H2O Environment Engine</h2>
+<h2 id="h2o-environment-engine">H2O Environment Engine<a class="headerlink" 
href="#h2o-environment-engine" title="Permanent link">&para;</a></h2>
 <p>The H2O backend implements the abstract DRM as an H2O Frame. Each logical 
column in the DRM is an H2O Vector. All elements of a logical DRM row are 
guaranteed to be homed on the same server. A set of rows stored on a server are 
presented as a read-only virtual in-core Matrix (i.e BlockMatrix) for the 
closure method in the <code>mapBlock(...)</code> API.</p>
 <p>H2O provides a flexible execution framework called <code>MRTask</code>. The 
<code>MRTask</code> framework typically executes over a Frame (or even a 
Vector), supports various types of map() methods, can optionally modify the 
Frame or Vector (though this never happens in the Mahout integration), and 
optionally create a new Vector or set of Vectors (to combine them into a new 
Frame, and consequently a new DRM).</p>
-<h2 id="source-layout">Source Layout</h2>
+<h2 id="source-layout">Source Layout<a class="headerlink" 
href="#source-layout" title="Permanent link">&para;</a></h2>
 <p>Within mahout.git, the top level directory, <code>h2o/</code> holds all the 
source code related to the H2O backend engine. Part of the code (that 
interfaces with the rest of the Mahout componenets) is in Scala, and part of 
the code (that interfaces with h2o-core and implements algebraic operators) is 
in Java. Here is a brief overview of what functionality can be found where 
within <code>h2o/</code>.</p>
 <p>h2o/ - top level directory containing all H2O related code</p>
 <p>h2o/src/main/java/org/apache/mahout/h2obindings/ops/*.java - Physical 
operator code for the various DSL algebra</p>

Modified: 
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
 Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,10 +264,21 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="how-to-create-and-app-using-mahout">How to create and App using 
Mahout</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="how-to-create-and-app-using-mahout">How to create and App using 
Mahout<a class="headerlink" href="#how-to-create-and-app-using-mahout" 
title="Permanent link">&para;</a></h1>
 <p>This is an example of how to create a simple app using Mahout as a Library. 
The source is available on Github in the <a 
href="https://github.com/pferrel/3-input-cooc";>3-input-cooc project</a> with 
more explanation about what it does (has to do with collaborative filtering). 
For this tutorial we'll concentrate on the app rather than the data science.</p>
 <p>The app reads in three user-item interactions types and creats indicators 
for them using cooccurrence and cross-cooccurrence. The indicators will be 
written to text files in a format ready for search engine indexing in search 
engine based recommender.</p>
-<h2 id="setup">Setup</h2>
+<h2 id="setup">Setup<a class="headerlink" href="#setup" title="Permanent 
link">&para;</a></h2>
 <p>In order to build and run the CooccurrenceDriver youÂ need to install the 
following:</p>
 <ul>
 <li>Install the Java 7 JDK from Oracle. Mac users look here: <a 
href="http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html";>Java
 SE Development Kit 7u72</a>.</li>
@@ -276,7 +288,7 @@
 </ul>
 <p>Why install if you are only using them as a library? Certain binaries and 
scripts are required by the libraries to get information about the environment 
like discovering where jars are located.</p>
 <p>Spark requires a set of jars on the classpath for the client side part of 
an app and another set of jars must be passed to the Spark Context for running 
distributed code. The example should discover all the neccessary classes 
automatically.</p>
-<h2 id="application">Application</h2>
+<h2 id="application">Application<a class="headerlink" href="#application" 
title="Permanent link">&para;</a></h2>
 <p>Using Mahout as a library in an application will require a little Scala 
code. Scala has an App trait so we'll create an object, which inherits from 
<code>App</code></p>
 <div class="codehilite"><pre><span class="n">object</span> <span 
class="n">CooccurrenceDriver</span> <span class="n">extends</span> <span 
class="n">App</span> <span class="p">{</span>
 <span class="p">}</span>
@@ -407,7 +419,7 @@ def writeIndicators<span class="p">(</sp
 </pre></div>
 
 
-<h2 id="build">Build</h2>
+<h2 id="build">Build<a class="headerlink" href="#build" title="Permanent 
link">&para;</a></h2>
 <p>Building the examples from project's root folder:</p>
 <div class="codehilite"><pre>$ <span class="n">sbt</span> <span 
class="n">pack</span>
 </pre></div>
@@ -419,7 +431,7 @@ def writeIndicators<span class="p">(</sp
 
 
 <p>The driver will execute in Spark standalone mode and put the data in 
/path/to/3-input-cooc/data/indicators/<em>indicator-type</em></p>
-<h2 id="using-a-debugger">Using a Debugger</h2>
+<h2 id="using-a-debugger">Using a Debugger<a class="headerlink" 
href="#using-a-debugger" title="Permanent link">&para;</a></h2>
 <p>To build and run this example in a debugger like IntelliJ IDEA. Install 
from the IntelliJ site and add the Scala plugin.</p>
 <p>Open IDEA and go to the menu File-&gt;New-&gt;Project from existing 
sources-&gt;SBT-&gt;/path/to/3-input-cooc. This will create an IDEA project 
from <code>build.sbt</code> in the root directory.</p>
 <p>At this point you may create a "Debug Configuration" to run. In the menu 
choose Run-&gt;Edit Configurations. Under "Default" choose "Application". In 
the dialog hit the elipsis button "..." to the right of "Environment Variables" 
and fill in your versions of JAVA_HOME, SPARK_HOME, and MAHOUT_HOME. In 
configuration editor under "Use classpath from" choose root-3-input-cooc 
module. </p>
@@ -427,7 +439,7 @@ def writeIndicators<span class="p">(</sp
 <p>Now choose "Application" in the left pane and hit the plus sign "+". give 
the config a name and hit the elipsis button to the right of the "Main class" 
field as shown.</p>
 <p><img alt="image" src="http://mahout.apache.org/images/debug-config-2.png"; 
/></p>
 <p>After setting breakpoints you are now ready to debug the configuration. Go 
to the Run-&gt;Debug... menu and pick your configuration. This will execute 
using a local standalone instance of Spark.</p>
-<h2 id="the-mahout-shell">The Mahout Shell</h2>
+<h2 id="the-mahout-shell">The Mahout Shell<a class="headerlink" 
href="#the-mahout-shell" title="Permanent link">&para;</a></h2>
 <p>For small script-like apps you may wish to use the Mahout shell. It is a 
Scala REPL type interactive shell built on the Spark shell with Mahout-Samsara 
extensions.</p>
 <p>To make the CooccurrenceDriver.scala into a script make the following 
changes:</p>
 <ul>

Modified: 
websites/staging/mahout/trunk/content/users/environment/in-core-reference.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/environment/in-core-reference.html 
(original)
+++ 
websites/staging/mahout/trunk/content/users/environment/in-core-reference.html 
Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,8 +264,19 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h2 
id="mahout-samsaras-in-core-linear-algebra-dsl-reference">Mahout-Samsara's 
In-Core Linear Algebra DSL Reference</h2>
-<h4 id="imports">Imports</h4>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h2 id="mahout-samsaras-in-core-linear-algebra-dsl-reference">Mahout-Samsara's 
In-Core Linear Algebra DSL Reference<a class="headerlink" 
href="#mahout-samsaras-in-core-linear-algebra-dsl-reference" title="Permanent 
link">&para;</a></h2>
+<h4 id="imports">Imports<a class="headerlink" href="#imports" title="Permanent 
link">&para;</a></h4>
 <p>The following imports are used to enable Mahout-Samsara's Scala DSL 
bindings for in-core Linear Algebra:</p>
 <div class="codehilite"><pre><span class="n">import</span> <span 
class="n">org</span><span class="p">.</span><span class="n">apache</span><span 
class="p">.</span><span class="n">mahout</span><span class="p">.</span><span 
class="n">math</span><span class="p">.</span><span class="n">_</span>
 <span class="n">import</span> <span class="n">scalabindings</span><span 
class="p">.</span><span class="n">_</span>
@@ -272,7 +284,7 @@
 </pre></div>
 
 
-<h4 id="inline-initalization">Inline initalization</h4>
+<h4 id="inline-initalization">Inline initalization<a class="headerlink" 
href="#inline-initalization" title="Permanent link">&para;</a></h4>
 <p>Dense vectors:</p>
 <div class="codehilite"><pre>val densVec1: Vector = (1.0, 1.1, 1.2)
 val denseVec2 = dvec(1, 0, 1,1 ,1,2)
@@ -314,7 +326,7 @@ val sparseVec1 = svec((5 -&gt; 1.0) :: (
 </pre></div>
 
 
-<h4 id="slicing-and-assigning">Slicing and Assigning</h4>
+<h4 id="slicing-and-assigning">Slicing and Assigning<a class="headerlink" 
href="#slicing-and-assigning" title="Permanent link">&para;</a></h4>
 <p>Getting a vector element:</p>
 <div class="codehilite"><pre><span class="n">val</span> <span 
class="n">d</span> <span class="p">=</span> <span class="n">vec</span><span 
class="p">(</span>5<span class="p">)</span>
 </pre></div>
@@ -388,7 +400,7 @@ val sparseVec1 = svec((5 -&gt; 1.0) :: (
 </pre></div>
 
 
-<h4 id="blas-like-operations">BLAS-like operations</h4>
+<h4 id="blas-like-operations">BLAS-like operations<a class="headerlink" 
href="#blas-like-operations" title="Permanent link">&para;</a></h4>
 <p>Plus/minus either vector or numeric with assignment or not:</p>
 <div class="codehilite"><pre><span class="n">a</span> <span class="o">+</span> 
<span class="n">b</span>
 <span class="n">a</span> <span class="o">-</span> <span class="n">b</span>
@@ -472,7 +484,7 @@ val sparseVec1 = svec((5 -&gt; 1.0) :: (
 
 
 <p>will not therefore incur any additional data copying.</p>
-<h4 id="decompositions">Decompositions</h4>
+<h4 id="decompositions">Decompositions<a class="headerlink" 
href="#decompositions" title="Permanent link">&para;</a></h4>
 <p>Matrix decompositions require an additional import:</p>
 <div class="codehilite"><pre><span class="n">import</span> <span 
class="n">org</span><span class="p">.</span><span class="n">apache</span><span 
class="p">.</span><span class="n">mahout</span><span class="p">.</span><span 
class="n">math</span><span class="p">.</span><span 
class="n">decompositions</span><span class="p">.</span><span class="n">_</span>
 </pre></div>
@@ -525,7 +537,7 @@ val sparseVec1 = svec((5 -&gt; 1.0) :: (
 </pre></div>
 
 
-<h4 id="misc">Misc</h4>
+<h4 id="misc">Misc<a class="headerlink" href="#misc" title="Permanent 
link">&para;</a></h4>
 <p>Vector cardinality:</p>
 <div class="codehilite"><pre><span class="n">a</span><span 
class="p">.</span><span class="nb">length</span>
 </pre></div>
@@ -550,7 +562,7 @@ val sparseVec1 = svec((5 -&gt; 1.0) :: (
 </pre></div>
 
 
-<h4 id="random-matrices">Random Matrices</h4>
+<h4 id="random-matrices">Random Matrices<a class="headerlink" 
href="#random-matrices" title="Permanent link">&para;</a></h4>
 <p><code>\(\mathcal{U}\)</code>(0,1) random matrix view:</p>
 <div class="codehilite"><pre><span class="n">val</span> <span 
class="n">incCoreA</span> <span class="p">=</span> <span 
class="n">Matrices</span><span class="p">.</span><span 
class="n">uniformView</span><span class="p">(</span><span 
class="n">m</span><span class="p">,</span> <span class="n">n</span><span 
class="p">,</span> <span class="n">seed</span><span class="p">)</span>
 </pre></div>
@@ -566,7 +578,7 @@ val sparseVec1 = svec((5 -&gt; 1.0) :: (
 </pre></div>
 
 
-<h4 id="iterators">Iterators</h4>
+<h4 id="iterators">Iterators<a class="headerlink" href="#iterators" 
title="Permanent link">&para;</a></h4>
 <p>Mahout-Math already exposes a number of iterators.  Scala code just needs 
the following imports to enable implicit conversions to scala iterators.</p>
 <div class="codehilite"><pre><span class="n">import</span> <span 
class="n">collection</span><span class="p">.</span><span class="n">_</span>
 <span class="n">import</span> <span class="n">JavaConversions</span><span 
class="p">.</span><span class="n">_</span>

svn commit: r985117 [5/6] - in /websites/staging/mahout/trunk/content: ./ developers/ general/ images/ users/algorithms/ users/basics/ users/classification/ users/clustering/ users/dim-reduction/ users/environment/ users/flinkbindings/ users/misc/ user...

Reply via email to