Modified:
websites/staging/mahout/trunk/content/users/clustering/streaming-k-means.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/streaming-k-means.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/streaming-k-means.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="streamingkmeans-algorithm"><em>StreamingKMeans</em> algorithm</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="streamingkmeans-algorithm"><em>StreamingKMeans</em> algorithm<a
class="headerlink" href="#streamingkmeans-algorithm" title="Permanent
link">¶</a></h1>
<p>The <em>StreamingKMeans</em> algorithm is a variant of Algorithm 1 from <a
href="http://nips.cc/Conferences/2011/Program/event.php?ID=2989" title="M.
Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For Large
Datasets">Shindler et al</a> and consists of two steps:</p>
<ol>
<li>Streaming step </li>
@@ -276,9 +288,9 @@ expected number of clusters is <em>k</em
clusters that will be passed on to the BallKMeans step which will further
reduce the
number of clusters down to <em>k</em>. BallKMeans is a randomized Lloyd-type
algorithm that
has been studied in detail, see <a
href="http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf" title="R.
Ostrovsky, Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of Lloyd-Type
Methods for the k-means Problem">Ostrovsky et al</a>.</p>
-<h2 id="streaming-step">Streaming step</h2>
+<h2 id="streaming-step">Streaming step<a class="headerlink"
href="#streaming-step" title="Permanent link">¶</a></h2>
<hr />
-<h3 id="overview">Overview</h3>
+<h3 id="overview">Overview<a class="headerlink" href="#overview"
title="Permanent link">¶</a></h3>
<p>The streaming step is a derivative of the streaming
portion of Algorithm 1 in <a
href="http://nips.cc/Conferences/2011/Program/event.php?ID=2989" title="M.
Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For Large
Datasets">Shindler et al</a>. The main difference between the two is that
Algorithm 1 of <a
href="http://nips.cc/Conferences/2011/Program/event.php?ID=2989" title="M.
Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For Large
Datasets">Shindler et al</a> assumes
@@ -290,7 +302,7 @@ In contrast, Mahout implementation does
data stream. Instead, it dynamically re-evaluates the parameters that depend
on the size
of the data stream at runtime as more and more data is processed. In
particular,
the parameter <em>numClusters</em> (defined below) changes its value as the
data is processed. </p>
-<h3 id="parameters">Parameters</h3>
+<h3 id="parameters">Parameters<a class="headerlink" href="#parameters"
title="Permanent link">¶</a></h3>
<ul>
<li><strong>numClusters</strong> (int): Conceptually, <em>numClusters</em>
represents the algorithm's guess at the optimal
number of clusters it is shooting for. In particular, <em>numClusters</em>
will increase at run
@@ -305,7 +317,7 @@ common ratio <em>beta</em> (see below).
<li><strong>clusterLogFactor</strong> (double): a constant parameter such that
<em>clusterLogFactor</em> <em>log(numProcessedPoints)</em> is the runtime
estimate of the number of clusters to be produced by the streaming step. If the
final number of clusters (that we expect <em>StreamingKMeans</em> to output) is
<em>k</em>, <em>clusterLogFactor</em> can be set to <em>k</em>. </li>
<li><strong>clusterOvershoot</strong> (double): a constant multiplicative
slack factor that slows down the collapsing of clusters. The default value is
2. </li>
</ul>
-<h3 id="algorithm">Algorithm</h3>
+<h3 id="algorithm">Algorithm<a class="headerlink" href="#algorithm"
title="Permanent link">¶</a></h3>
<p>The algorithm processes the data one-by-one and makes only one pass through
the data.
The first point from the data stream will form the centroid of the first
cluster (this designation may change as more points are processed). Suppose
there are <em>r</em> clusters at one point and a new point <em>p</em> is being
processed. The new point can either be added to one of the existing <em>r</em>
clusters or become a new cluster. To decide:</p>
<ul>
@@ -317,16 +329,16 @@ The first point from the data stream wil
<p>There will be either <em>r</em> or <em>r+1</em> clusters after processing a
new point.</p>
<p>As the number of clusters increases, it will go over the
<em>clusterOvershoot * numClusters</em> limit (<em>numClusters</em> represents
a recommendation for the number of clusters that the streaming step should aim
for and <em>clusterOvershoot</em> is the slack). To decrease the number of
clusters the existing clusters
are treated as data points and are re-clustered (collapsed). This tends to
make the number of clusters go down. If the number of clusters is still too
high, <em>distanceCutoff</em> is increased.</p>
-<h2 id="ballkmeans-step">BallKMeans step</h2>
+<h2 id="ballkmeans-step">BallKMeans step<a class="headerlink"
href="#ballkmeans-step" title="Permanent link">¶</a></h2>
<hr />
-<h3 id="overview_1">Overview</h3>
+<h3 id="overview_1">Overview<a class="headerlink" href="#overview_1"
title="Permanent link">¶</a></h3>
<p>The algorithm is a Lloyd-type algorithm that takes a set of weighted
vectors and returns k centroids, see <a
href="http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf" title="R.
Ostrovsky, Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of Lloyd-Type
Methods for the k-means Problem">Ostrovsky et al</a> for details. The algorithm
has two stages: </p>
<ol>
<li>Seeding </li>
<li>Ball k-means </li>
</ol>
<p>The seeding stage is an initial guess of where the centroids should be. The
initial guess is improved using the ball k-means stage. </p>
-<h3 id="parameters_1">Parameters</h3>
+<h3 id="parameters_1">Parameters<a class="headerlink" href="#parameters_1"
title="Permanent link">¶</a></h3>
<ul>
<li>
<p><strong>numClusters</strong> (int): the number k of centroids to return.
The algorithm will return exactly this number of centroids.</p>
@@ -350,7 +362,7 @@ are treated as data points and are re-cl
<p><strong>numRuns</strong> (int): This is the number of runs to perform. The
solution of lowest cost is returned. The default is 1 run.</p>
</li>
</ul>
-<h3 id="algorithm_1">Algorithm</h3>
+<h3 id="algorithm_1">Algorithm<a class="headerlink" href="#algorithm_1"
title="Permanent link">¶</a></h3>
<p>The algorithm can be instructed to take multiple independent runs (using
the <em>numRuns</em> parameter) and the algorithm will select the best solution
(i.e., the one with the lowest cost). In practice, one run is sufficient to
find a good solution. </p>
<p>Each run operates as follows: a seeding procedure is used to select k
centroids, and then ball k-means is run iteratively to refine the solution.</p>
<p>The seeding procedure can be set to either 'uniformly at random' or
'k-means++' using <em>kMeansPlusPlusInit</em> boolean variable. Seeding with
k-means++ involves more computation but offers better results in practice. </p>
@@ -360,7 +372,7 @@ are treated as data points and are re-cl
<li>The centers of mass of the trimmed clusters (see <em>trimFraction</em>
parameter above) become the new centroids </li>
</ol>
<p>The data may be partitioned into a test set and a training set (see
<em>testProbability</em>). The seeding procedure and ball k-means run on the
training set. The cost is computed on the test set.</p>
-<h2 id="usage-of-streamingkmeans">Usage of <em>StreamingKMeans</em></h2>
+<h2 id="usage-of-streamingkmeans">Usage of <em>StreamingKMeans</em><a
class="headerlink" href="#usage-of-streamingkmeans" title="Permanent
link">¶</a></h2>
<div class="codehilite"><pre> <span class="n">bin</span><span
class="o">/</span><span class="n">mahout</span> <span
class="n">streamingkmeans</span>
<span class="o">-</span><span class="nb">i</span> <span
class="o"><</span><span class="n">input</span><span class="o">></span>
<span class="o">-</span><span class="n">o</span> <span
class="o"><</span><span class="n">output</span><span class="o">></span>
@@ -387,7 +399,7 @@ are treated as data points and are re-cl
</pre></div>
-<h3 id="details-on-job-specific-options">Details on Job-Specific Options:</h3>
+<h3 id="details-on-job-specific-options">Details on Job-Specific Options:<a
class="headerlink" href="#details-on-job-specific-options" title="Permanent
link">¶</a></h3>
<ul>
<li><code>--input (-i) <input></code>: Path to job input directory.
</li>
<li><code>--output (-o) <output></code>: The directory pathname for
output. </li>
@@ -412,7 +424,7 @@ are treated as data points and are re-cl
<li><code>--startPhase <startPhase></code> First phase to run. </li>
<li><code>--endPhase <endPhase></code> Last phase to run. </li>
</ul>
-<h2 id="references">References</h2>
+<h2 id="references">References<a class="headerlink" href="#references"
title="Permanent link">¶</a></h2>
<ol>
<li><a href="http://nips.cc/Conferences/2011/Program/event.php?ID=2989"
title="M. Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For Large
Datasets">M. Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For
Large Datasets</a></li>
<li><a href="http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf"
title="R. Ostrovsky, Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of
Lloyd-Type Methods for the k-means Problem">R. Ostrovsky, Y. Rabani, L.
Schulman, Ch. Swamy: The Effectiveness of Lloyd-Type Methods for the k-means
Problem</a></li>
Modified:
websites/staging/mahout/trunk/content/users/clustering/viewing-result.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/clustering/viewing-result.html
(original)
+++ websites/staging/mahout/trunk/content/users/clustering/viewing-result.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,14 +264,25 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <ul>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<ul>
<li><a href="#ViewingResult-AlgorithmViewingpages">Algorithm Viewing
pages</a></li>
</ul>
<p>There are various technologies available to view the output of Mahout
algorithms.
* Clusters</p>
<p><a name="ViewingResult-AlgorithmViewingpages"></a></p>
-<h1 id="algorithm-viewing-pages">Algorithm Viewing pages</h1>
+<h1 id="algorithm-viewing-pages">Algorithm Viewing pages<a class="headerlink"
href="#algorithm-viewing-pages" title="Permanent link">¶</a></h1>
<p>{pagetree:root=@self|excerpt=true|expandCollapseAll=true}</p>
</div>
</div>
Modified:
websites/staging/mahout/trunk/content/users/clustering/viewing-results.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/clustering/viewing-results.html
(original)
+++ websites/staging/mahout/trunk/content/users/clustering/viewing-results.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,27 +264,38 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="ViewingResults-Intro"></a></p>
-<h1 id="intro">Intro</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="ViewingResults-Intro"></a></p>
+<h1 id="intro">Intro<a class="headerlink" href="#intro" title="Permanent
link">¶</a></h1>
<p>Many of the Mahout libraries run as batch jobs, dumping results into Hadoop
sequence files or other data structures. This page is intended to
demonstrate the various ways one might inspect the outcome of various jobs.
The page is organized by algorithms.</p>
<p><a name="ViewingResults-GeneralUtilities"></a></p>
-<h1 id="general-utilities">General Utilities</h1>
+<h1 id="general-utilities">General Utilities<a class="headerlink"
href="#general-utilities" title="Permanent link">¶</a></h1>
<p><a name="ViewingResults-SequenceFileDumper"></a></p>
-<h2 id="sequence-file-dumper">Sequence File Dumper</h2>
+<h2 id="sequence-file-dumper">Sequence File Dumper<a class="headerlink"
href="#sequence-file-dumper" title="Permanent link">¶</a></h2>
<p><a name="ViewingResults-Clustering"></a></p>
-<h1 id="clustering">Clustering</h1>
+<h1 id="clustering">Clustering<a class="headerlink" href="#clustering"
title="Permanent link">¶</a></h1>
<p><a name="ViewingResults-ClusterDumper"></a></p>
-<h2 id="cluster-dumper">Cluster Dumper</h2>
+<h2 id="cluster-dumper">Cluster Dumper<a class="headerlink"
href="#cluster-dumper" title="Permanent link">¶</a></h2>
<p>Run the following to print out all options:</p>
<div class="codehilite"><pre><span class="n">java</span> <span
class="o">-</span><span class="n">cp</span> "<span
class="o">*</span>" <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">mahout</span><span class="p">.</span><span
class="n">utils</span><span class="p">.</span><span
class="n">clustering</span><span class="p">.</span><span
class="n">ClusterDumper</span> <span class="o">--</span><span
class="n">help</span>
</pre></div>
<p><a name="ViewingResults-Example"></a></p>
-<h3 id="example">Example</h3>
+<h3 id="example">Example<a class="headerlink" href="#example" title="Permanent
link">¶</a></h3>
<div class="codehilite"><pre><span class="n">java</span> <span
class="o">-</span><span class="n">cp</span> "<span
class="o">*</span>" <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">mahout</span><span class="p">.</span><span
class="n">utils</span><span class="p">.</span><span
class="n">clustering</span><span class="p">.</span><span
class="n">ClusterDumper</span> <span class="o">--</span><span
class="n">seqFileDir</span>
</pre></div>
@@ -292,9 +304,9 @@ demonstrate the various ways one might i
--dictionary ./solr-clust-n2/dictionary.txt
--substring 100 --pointsDir ./solr-clust-n2/out/points/</p>
<p><a name="ViewingResults-ClusterLabels(MAHOUT-163)"></a></p>
-<h2 id="cluster-labels-mahout-163">Cluster Labels (MAHOUT-163)</h2>
+<h2 id="cluster-labels-mahout-163">Cluster Labels (MAHOUT-163)<a
class="headerlink" href="#cluster-labels-mahout-163" title="Permanent
link">¶</a></h2>
<p><a name="ViewingResults-Classification"></a></p>
-<h1 id="classification">Classification</h1>
+<h1 id="classification">Classification<a class="headerlink"
href="#classification" title="Permanent link">¶</a></h1>
</div>
</div>
</div>
Modified:
websites/staging/mahout/trunk/content/users/clustering/visualizing-sample-clusters.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/visualizing-sample-clusters.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/visualizing-sample-clusters.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="VisualizingSampleClusters-Introduction"></a></p>
-<h1 id="introduction">Introduction</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="VisualizingSampleClusters-Introduction"></a></p>
+<h1 id="introduction">Introduction<a class="headerlink" href="#introduction"
title="Permanent link">¶</a></h1>
<p>Mahout provides examples to visualize sample clusters that gets created by
our clustering algorithms. Note that the visualization is done by Swing
programs. You have to be in a window system on the same
machine you run these, or logged in via a remote desktop.</p>
@@ -272,7 +284,7 @@ machine you run these, or logged in via
classes under <em>org.apache.mahout.clustering.display</em> package in
mahout-examples module. The easiest way to achieve this is to <a
href="users/basics/quickstart.html">setup Mahout</a> in your IDE.</p>
<p><a name="VisualizingSampleClusters-Visualizingclusters"></a></p>
-<h1 id="visualizing-clusters">Visualizing clusters</h1>
+<h1 id="visualizing-clusters">Visualizing clusters<a class="headerlink"
href="#visualizing-clusters" title="Permanent link">¶</a></h1>
<p>The following classes in <em>org.apache.mahout.clustering.display</em> can
be run
without parameters to generate a sample data set and run the reference
clustering implementations over them:</p>
Modified:
websites/staging/mahout/trunk/content/users/dim-reduction/dimensional-reduction.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/dim-reduction/dimensional-reduction.html
(original)
+++
websites/staging/mahout/trunk/content/users/dim-reduction/dimensional-reduction.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="support-for-dimensional-reduction">Support for dimensional
reduction</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="support-for-dimensional-reduction">Support for dimensional reduction<a
class="headerlink" href="#support-for-dimensional-reduction" title="Permanent
link">¶</a></h1>
<p>Matrix algebra underpins the way many Big Data algorithms and data
structures are composed: full-text search can be viewed as doing matrix
multiplication of the term-document matrix by the query vector (giving a
@@ -307,16 +319,16 @@ course, sparse matrices which don't fit
far as decomposition is concerned. Parallelizable and/or stream-oriented
algorithms are needed.</p>
<p><a name="DimensionalReduction-SingularValueDecomposition"></a></p>
-<h1 id="singular-value-decomposition">Singular Value Decomposition</h1>
+<h1 id="singular-value-decomposition">Singular Value Decomposition<a
class="headerlink" href="#singular-value-decomposition" title="Permanent
link">¶</a></h1>
<p>Currently implemented in Mahout (as of 0.3, the first release with
MAHOUT-180 applied), are two scalable implementations of SVD, a stream-oriented
implementation using the Asymmetric Generalized Hebbian Algorithm outlined in
Genevieve Gorrell & Brandyn Webb's paper (<a
href="-http://www.dcs.shef.ac.uk/~genevieve/gorrell_webb.pdf.html">Gorrell and
Webb 2005</a>
); and there is a [Lanczos | http://en.wikipedia.org/wiki/Lanczos_algorithm]
implementation, both single-threaded, and in the
o.a.m.math.decomposer.lanczos package (math module), as a hadoop map-reduce
(series of) job(s) in o.a.m.math.hadoop.decomposer package (core module).
Coming soon: stochastic decomposition.</p>
-<p>See also: <a
href="Wikipedia%20-%20SVD">https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition</a></p>
+<p>See also: <a href="Wikipedia -
SVD">https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition</a></p>
<p><a name="DimensionalReduction-Lanczos"></a></p>
-<h2 id="lanczos">Lanczos</h2>
+<h2 id="lanczos">Lanczos<a class="headerlink" href="#lanczos" title="Permanent
link">¶</a></h2>
<p>The Lanczos algorithm is designed for eigen-decomposition, but like any
such algorithm, getting singular vectors out of it is immediate (singular
vectors of matrix A are just the eigenvectors of A^t * A or A * A^t).
@@ -344,7 +356,7 @@ via Lanczos, and then discard the bottom
the largest singular values (which is the case for using Lanczos for
dimensional reduction).</p>
<p><a name="DimensionalReduction-ParallelizationStragegy"></a></p>
-<h3 id="parallelization-stragegy">Parallelization Stragegy</h3>
+<h3 id="parallelization-stragegy">Parallelization Stragegy<a
class="headerlink" href="#parallelization-stragegy" title="Permanent
link">¶</a></h3>
<p>Lanczos is "embarassingly parallelizable": matrix multiplication of a
matrix by a vector may be carried out row-at-a-time without communication
until at the end, the results of the intermediate matrix-by-vector outputs
@@ -359,7 +371,7 @@ delaying writing to disk until Mapper cl
a Combiner be the same as the Reducer, the bottleneck in accumulation is
nowhere near a single point.</p>
<p><a name="DimensionalReduction-Mahoutusage"></a></p>
-<h3 id="mahout-usage">Mahout usage</h3>
+<h3 id="mahout-usage">Mahout usage<a class="headerlink" href="#mahout-usage"
title="Permanent link">¶</a></h3>
<p>The Mahout DistributedLanzcosSolver is invoked by the
<MAHOUT_HOME>/bin/mahout svd command. This command takes the following
arguments (which can be reproduced by just entering the command with no
@@ -456,7 +468,7 @@ the long form svd invocation:</p>
<p>TODO: also allow exclusion based on improper orthogonality (currently
computed, but not checked against constraints).</p>
<p><a
name="DimensionalReduction-Example:SVDofASFMailArchivesonAmazonElasticMapReduce"></a></p>
-<h4 id="example-svd-of-asf-mail-archives-on-amazon-elastic-mapreduce">Example:
SVD of ASF Mail Archives on Amazon Elastic MapReduce</h4>
+<h4 id="example-svd-of-asf-mail-archives-on-amazon-elastic-mapreduce">Example:
SVD of ASF Mail Archives on Amazon Elastic MapReduce<a class="headerlink"
href="#example-svd-of-asf-mail-archives-on-amazon-elastic-mapreduce"
title="Permanent link">¶</a></h4>
<p>This section walks you through a complete example of running the Mahout SVD
job on Amazon Elastic MapReduce cluster and then preparing the output to be
used for clustering. This example was developed as part of the effort to
@@ -479,7 +491,7 @@ mailing list, see: <a href="http://searc
<p>Note: Some of this work is due in part to credits donated by the Amazon
Elastic MapReduce team.</p>
<p><a name="DimensionalReduction-1.LaunchEMRCluster"></a></p>
-<h5 id="1-launch-emr-cluster">1. Launch EMR Cluster</h5>
+<h5 id="1-launch-emr-cluster">1. Launch EMR Cluster<a class="headerlink"
href="#1-launch-emr-cluster" title="Permanent link">¶</a></h5>
<p>For a detailed explanation of the steps involved in launching an Amazon
Elastic MapReduce cluster for running Mahout jobs, please read the
"Building Vectors for Large Document Sets" section of <a
href="https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce">Mahout
on Elastic MapReduce</a>
@@ -487,11 +499,11 @@ Elastic MapReduce cluster for running Ma
<p>In the remaining steps below, remember to replace JOB_ID with the Job ID of
your EMR cluster.</p>
<p><a name="DimensionalReduction-2.LoadMahout0.5+JARintoS3"></a></p>
-<h5 id="2-load-mahout-05-jar-into-s3">2. Load Mahout 0.5+ JAR into S3</h5>
+<h5 id="2-load-mahout-05-jar-into-s3">2. Load Mahout 0.5+ JAR into S3<a
class="headerlink" href="#2-load-mahout-05-jar-into-s3" title="Permanent
link">¶</a></h5>
<p>These steps were created with the mahout-0.5-SNAPSHOT because they rely on
the patch for <a
href="https://issues.apache.org/jira/browse/MAHOUT-639">MAHOUT-639</a></p>
<p><a name="DimensionalReduction-3.CopyTFIDFVectorsintoHDFS"></a></p>
-<h5 id="3-copy-tfidf-vectors-into-hdfs">3. Copy TFIDF Vectors into HDFS</h5>
+<h5 id="3-copy-tfidf-vectors-into-hdfs">3. Copy TFIDF Vectors into HDFS<a
class="headerlink" href="#3-copy-tfidf-vectors-into-hdfs" title="Permanent
link">¶</a></h5>
<p>Before running your SVD job on the vectors, you need to copy them from S3
to your EMR cluster's HDFS.</p>
<div class="codehilite"><pre><span class="n">elastic</span><span
class="o">-</span><span class="n">mapreduce</span> <span
class="o">--</span><span class="n">jar</span> <span class="n">s3</span><span
class="p">:</span><span class="o">//</span><span
class="n">elasticmapreduce</span><span class="o">/</span><span
class="n">samples</span><span class="o">/</span><span
class="n">distcp</span><span class="o">/</span><span
class="n">distcp</span><span class="p">.</span><span class="n">jar</span> <span
class="o">\</span>
@@ -502,7 +514,7 @@ to your EMR cluster's HDFS.</p>
<p><a name="DimensionalReduction-4.RuntheSVDJob"></a></p>
-<h5 id="4-run-the-svd-job">4. Run the SVD Job</h5>
+<h5 id="4-run-the-svd-job">4. Run the SVD Job<a class="headerlink"
href="#4-run-the-svd-job" title="Permanent link">¶</a></h5>
<p>Now you're ready to run the SVD job on the vectors stored in HDFS:</p>
<div class="codehilite"><pre><span class="n">elastic</span><span
class="o">-</span><span class="n">mapreduce</span> <span
class="o">--</span><span class="n">jar</span> <span class="n">s3</span><span
class="p">:</span><span class="o">//</span><span class="n">BUCKET</span><span
class="o">/</span><span class="n">mahout</span><span class="o">-</span><span
class="n">examples</span><span class="o">-</span>0<span
class="p">.</span>5<span class="o">-</span><span class="n">SNAPSHOT</span><span
class="o">-</span><span class="n">job</span><span class="p">.</span><span
class="n">jar</span> <span class="o">\</span>
<span class="o">--</span><span class="n">main</span><span
class="o">-</span><span class="n">class</span> <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">mahout</span><span class="p">.</span><span
class="n">driver</span><span class="p">.</span><span
class="n">MahoutDriver</span> <span class="o">\</span>
@@ -528,7 +540,7 @@ removes any duplicate eigenvectors cause
overflow and any that don't appear to be "eigen" enough (ie, they don't
satisfy the eigenvector criterion with high enough fidelity). - Jake Mannix</p>
<p><a
name="DimensionalReduction-5.TransformyourTFIDFVectorsintoMahoutMatrix"></a></p>
-<h5 id="5-transform-your-tfidf-vectors-into-mahout-matrix">5. Transform your
TFIDF Vectors into Mahout Matrix</h5>
+<h5 id="5-transform-your-tfidf-vectors-into-mahout-matrix">5. Transform your
TFIDF Vectors into Mahout Matrix<a class="headerlink"
href="#5-transform-your-tfidf-vectors-into-mahout-matrix" title="Permanent
link">¶</a></h5>
<p>The tfidf vectors created by the seq2sparse job are
SequenceFile<Text,VectorWritable>. The Mahout RowId job transforms these
vectors into a matrix form that is a
@@ -558,7 +570,7 @@ your EMR cluster. The job produces the f
<p>where docIndex is the SequenceFile<IntWritable,Text> and matrix is
SequenceFile<IntWritable,VectorWritable>.</p>
<p><a name="DimensionalReduction-6.TransposetheMatrix"></a></p>
-<h5 id="6-transpose-the-matrix">6. Transpose the Matrix</h5>
+<h5 id="6-transpose-the-matrix">6. Transpose the Matrix<a class="headerlink"
href="#6-transpose-the-matrix" title="Permanent link">¶</a></h5>
<p>Our ultimate goal is to multiply the TFIDF vector matrix times our SVD
eigenvectors. For the mathematically inclined, from the rowid job, we now
have an m x n matrix T (m=6076937, n=20444). The SVD eigenvector matrix E
@@ -598,7 +610,7 @@ numColsZ == numColsX). - Jake Mannix</p>
<p><a name="DimensionalReduction-7.TransposeEigenvectors"></a></p>
-<h5 id="7-transpose-eigenvectors">7. Transpose Eigenvectors</h5>
+<h5 id="7-transpose-eigenvectors">7. Transpose Eigenvectors<a
class="headerlink" href="#7-transpose-eigenvectors" title="Permanent
link">¶</a></h5>
<p>If you followed Jake's explanation in step 6 above, then you know that we
also need to transpose the eigenvectors:</p>
<div class="codehilite"><pre><span class="n">elastic</span><span
class="o">-</span><span class="n">mapreduce</span> <span
class="o">--</span><span class="n">jar</span> <span class="n">s3</span><span
class="p">:</span><span class="o">//</span><span class="n">BUCKET</span><span
class="o">/</span><span class="n">mahout</span><span class="o">-</span><span
class="n">examples</span><span class="o">-</span>0<span
class="p">.</span>5<span class="o">-</span><span class="n">SNAPSHOT</span><span
class="o">-</span><span class="n">job</span><span class="p">.</span><span
class="n">jar</span> <span class="o">\</span>
@@ -620,7 +632,7 @@ transposing the matrix you are multiplyi
<p><a name="DimensionalReduction-8.MatrixMultiplication"></a></p>
-<h5 id="8-matrix-multiplication">8. Matrix Multiplication</h5>
+<h5 id="8-matrix-multiplication">8. Matrix Multiplication<a class="headerlink"
href="#8-matrix-multiplication" title="Permanent link">¶</a></h5>
<p>Lastly, we need to multiply the transposed vectors using Mahout's
matrixmult job:</p>
<div class="codehilite"><pre><span class="n">elastic</span><span
class="o">-</span><span class="n">mapreduce</span> <span
class="o">--</span><span class="n">jar</span> <span class="n">s3</span><span
class="p">:</span><span class="o">//</span><span class="n">BUCKET</span><span
class="o">/</span><span class="n">mahout</span><span class="o">-</span><span
class="n">examples</span><span class="o">-</span>0<span
class="p">.</span>5<span class="o">-</span><span class="n">SNAPSHOT</span><span
class="o">-</span><span class="n">job</span><span class="p">.</span><span
class="n">jar</span> <span class="o">\</span>
@@ -643,7 +655,7 @@ matrixmult job:</p>
<p><a name="DimensionalReduction-Resources"></a></p>
-<h1 id="resources">Resources</h1>
+<h1 id="resources">Resources<a class="headerlink" href="#resources"
title="Permanent link">¶</a></h1>
<ul>
<li><a href="http://www.dcs.shef.ac.uk/~genevieve/lsa_tutorial.htm">LSA
tutorial</a></li>
<li><a
href="http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html">SVD
tutorial</a></li>
Modified: websites/staging/mahout/trunk/content/users/dim-reduction/ssvd.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/dim-reduction/ssvd.html
(original)
+++ websites/staging/mahout/trunk/content/users/dim-reduction/ssvd.html Fri Apr
8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,10 +264,21 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="stochastic-singular-value-decomposition">Stochastic Singular Value
Decomposition</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="stochastic-singular-value-decomposition">Stochastic Singular Value
Decomposition<a class="headerlink"
href="#stochastic-singular-value-decomposition" title="Permanent
link">¶</a></h1>
<p>Stochastic SVD method in Mahout produces reduced rank Singular Value
Decomposition output in its
strict mathematical definition: <code>\(\mathbf{A\approx
U}\boldsymbol{\Sigma}\mathbf{V}^{\top}\)</code>.</p>
-<h2 id="the-benefits-over-other-methods-are">The benefits over other methods
are:</h2>
+<h2 id="the-benefits-over-other-methods-are">The benefits over other methods
are:<a class="headerlink" href="#the-benefits-over-other-methods-are"
title="Permanent link">¶</a></h2>
<ul>
<li>
<p>reduced flops required compared to Krylov subspace methods</p>
@@ -284,14 +296,14 @@ strict mathematical definition: <code>\(
<p>As of 0.7 trunk, includes PCA and dimensionality reduction workflow
(EXPERIMENTAL! Feedback on performance/other PCA related issues/ blogs is
greatly appreciated.)</p>
</li>
</ul>
-<h3 id="map-reduce-characteristics">Map-Reduce characteristics:</h3>
+<h3 id="map-reduce-characteristics">Map-Reduce characteristics:<a
class="headerlink" href="#map-reduce-characteristics" title="Permanent
link">¶</a></h3>
<p>SSVD uses at most 3 MR sequential steps (map-only + map-reduce + 2 optional
parallel map-reduce jobs) to produce reduced rank approximation of U, V and S
matrices. Additionally, two more map-reduce steps are added for each power
iteration step if requested.</p>
-<h2 id="potential-drawbacks">Potential drawbacks:</h2>
+<h2 id="potential-drawbacks">Potential drawbacks:<a class="headerlink"
href="#potential-drawbacks" title="Permanent link">¶</a></h2>
<p>potentially less precise (but adding even one power iteration seems to fix
that quite a bit).</p>
-<h2 id="documentation">Documentation</h2>
+<h2 id="documentation">Documentation<a class="headerlink"
href="#documentation" title="Permanent link">¶</a></h2>
<p><a href="ssvd.page/SSVD-CLI.pdf">Overview and Usage</a></p>
<p>Note: Please use 0.6 or later! for PCA workflow, please use 0.7 or
later.</p>
-<h2 id="publications">Publications</h2>
+<h2 id="publications">Publications<a class="headerlink" href="#publications"
title="Permanent link">¶</a></h2>
<p><a
href="http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf">Nathan
Halko's dissertation</a> "Randomized methods for computing low-rank
approximations of matrices" contains comprehensive definition of
parallelization strategy taken in Mahout SSVD implementation and also some
precision/scalability benchmarks, esp. w.r.t. Mahout Lanczos implementation on
a typical corpus data set.</p>
<p><a href="http://arxiv.org/abs/0909.4061">Halko, Martinsson, Tropp</a> paper
discusses family of random projection-based algorithms and contains theoretical
error estimates.</p>
@@ -318,7 +330,7 @@ x<span class="o"><-</span> usim <span
<p>and try to compare ssvd.svd(x) and stock svd(x) performance for the same
rank k, notice the difference in the running time. Also play with power
iterations (qIter) and compare accuracies of standard svd and SSVD.</p>
<p>Note: numerical stability of R algorithms may differ from that of Mahout's
distributed version. We haven't studied accuracy of the R simulation. For study
of accuracy of Mahout's version, please refer to Nathan's dissertation as
referenced above.</p>
-<h4 id="modified-ssvd-algorithm">Modified SSVD Algorithm.</h4>
+<h4 id="modified-ssvd-algorithm">Modified SSVD Algorithm.<a class="headerlink"
href="#modified-ssvd-algorithm" title="Permanent link">¶</a></h4>
<p>Given an <code>\(m\times n\)</code>
matrix <code>\(\mathbf{A}\)</code>, a target rank
<code>\(k\in\mathbb{N}_{1}\)</code>
, an oversampling parameter <code>\(p\in\mathbb{N}_{1}\)</code>,
Modified:
websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html
(original)
+++
websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,11 +264,22 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="building-a-text-classifier-in-mahouts-spark-shell">Building a text
classifier in Mahout's Spark Shell</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="building-a-text-classifier-in-mahouts-spark-shell">Building a text
classifier in Mahout's Spark Shell<a class="headerlink"
href="#building-a-text-classifier-in-mahouts-spark-shell" title="Permanent
link">¶</a></h1>
<p>This tutorial will take you through the steps used to train a Multinomial
Naive Bayes model and create a text classifier based on that model using the
<code>mahout spark-shell</code>. </p>
-<h2 id="prerequisites">Prerequisites</h2>
+<h2 id="prerequisites">Prerequisites<a class="headerlink"
href="#prerequisites" title="Permanent link">¶</a></h2>
<p>This tutorial assumes that you have your Spark environment variables set
for the <code>mahout spark-shell</code> see: <a
href="http://mahout.apache.org/users/sparkbindings/play-with-shell.html">Playing
with Mahout's Shell</a>. As well we assume that Mahout is running in cluster
mode (i.e. with the <code>MAHOUT_LOCAL</code> environment variable
<strong>unset</strong>) as we'll be reading and writing to HDFS.</p>
-<h2 id="downloading-and-vectorizing-the-wikipedia-dataset">Downloading and
Vectorizing the Wikipedia dataset</h2>
+<h2 id="downloading-and-vectorizing-the-wikipedia-dataset">Downloading and
Vectorizing the Wikipedia dataset<a class="headerlink"
href="#downloading-and-vectorizing-the-wikipedia-dataset" title="Permanent
link">¶</a></h2>
<p><em>As of Mahout v. 0.10.0, we are still reliant on the MapReduce versions
of <code>mahout seqwiki</code> and <code>mahout seq2sparse</code> to extract
and vectorize our text. A</em> <a
href="https://issues.apache.org/jira/browse/MAHOUT-1663"><em>Spark
implementation of seq2sparse</em></a> <em>is in the works for Mahout v.
0.11.</em> However, to download the Wikipedia dataset, extract the bodies of
the documentation, label each document and vectorize the text into TF-IDF
vectors, we can simpmly run the <a
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh">wikipedia-classifier.sh</a>
example. </p>
<div class="codehilite"><pre><span class="n">Please</span> <span
class="n">select</span> <span class="n">a</span> <span class="n">number</span>
<span class="n">to</span> <span class="n">choose</span> <span
class="n">the</span> <span class="n">corresponding</span> <span
class="n">task</span> <span class="n">to</span> <span class="n">run</span>
1<span class="p">.</span> <span class="n">CBayes</span> <span
class="p">(</span><span class="n">may</span> <span class="n">require</span>
<span class="n">increased</span> <span class="n">heap</span> <span
class="n">space</span> <span class="n">on</span> <span
class="n">yarn</span><span class="p">)</span>
@@ -278,14 +290,14 @@
<p>Enter (2). This will download a large recent XML dump of the Wikipedia
database, into a <code>/tmp/mahout-work-wiki</code> directory, unzip it and
place it into HDFS. It will run a <a
href="http://mahout.apache.org/users/classification/wikipedia-classifier-example.html">MapReduce
job to parse the wikipedia set</a>, extracting and labeling only pages with
category tags for [United States] and [United Kingdom] (~11600 documents). It
will then run <code>mahout seq2sparse</code> to convert the documents into
TF-IDF vectors. The script will also a build and test a <a
href="http://mahout.apache.org/users/classification/bayesian.html">Naive Bayes
model using MapReduce</a>. When it is completed, you should see a confusion
matrix on your screen. For this tutorial, we will ignore the MapReduce model,
and build a new model using Spark based on the vectorized text output by
<code>seq2sparse</code>.</p>
-<h2 id="getting-started">Getting Started</h2>
+<h2 id="getting-started">Getting Started<a class="headerlink"
href="#getting-started" title="Permanent link">¶</a></h2>
<p>Launch the <code>mahout spark-shell</code>. There is an example script:
<code>spark-document-classifier.mscala</code> (.mscala denotes a Mahout-Scala
script which can be run similarly to an R script). We will be walking through
this script for this tutorial but if you wanted to simply run the script, you
could just issue the command: </p>
<div class="codehilite"><pre><span class="n">mahout</span><span
class="o">></span> <span class="p">:</span><span class="n">load</span> <span
class="o">/</span><span class="n">path</span><span class="o">/</span><span
class="n">to</span><span class="o">/</span><span class="n">mahout</span><span
class="o">/</span><span class="n">examples</span><span class="o">/</span><span
class="n">bin</span><span class="o">/</span><span class="n">spark</span><span
class="o">-</span><span class="n">document</span><span class="o">-</span><span
class="n">classifier</span><span class="p">.</span><span class="n">mscala</span>
</pre></div>
<p>For now, lets take the script apart piece by piece. You can cut and paste
the following code blocks into the <code>mahout spark-shell</code>.</p>
-<h2 id="imports">Imports</h2>
+<h2 id="imports">Imports<a class="headerlink" href="#imports" title="Permanent
link">¶</a></h2>
<p>Our Mahout Naive Bayes imports:</p>
<div class="codehilite"><pre><span class="n">import</span> <span
class="n">org</span><span class="p">.</span><span class="n">apache</span><span
class="p">.</span><span class="n">mahout</span><span class="p">.</span><span
class="n">classifier</span><span class="p">.</span><span
class="n">naivebayes</span><span class="p">.</span><span class="n">_</span>
<span class="n">import</span> <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">mahout</span><span class="p">.</span><span
class="n">classifier</span><span class="p">.</span><span
class="n">stats</span><span class="p">.</span><span class="n">_</span>
@@ -300,19 +312,19 @@
</pre></div>
-<h2
id="read-in-our-full-set-from-hdfs-as-vectorized-by-seq2sparse-in-classify-wikipediash">Read
in our full set from HDFS as vectorized by seq2sparse in
classify-wikipedia.sh</h2>
+<h2
id="read-in-our-full-set-from-hdfs-as-vectorized-by-seq2sparse-in-classify-wikipediash">Read
in our full set from HDFS as vectorized by seq2sparse in
classify-wikipedia.sh<a class="headerlink"
href="#read-in-our-full-set-from-hdfs-as-vectorized-by-seq2sparse-in-classify-wikipediash"
title="Permanent link">¶</a></h2>
<div class="codehilite"><pre><span class="n">val</span> <span
class="n">pathToData</span> <span class="p">=</span> "<span
class="o">/</span><span class="n">tmp</span><span class="o">/</span><span
class="n">mahout</span><span class="o">-</span><span class="n">work</span><span
class="o">-</span><span class="n">wiki</span><span class="o">/</span>"
<span class="n">val</span> <span class="n">fullData</span> <span
class="p">=</span> <span class="n">drmDfsRead</span><span
class="p">(</span><span class="n">pathToData</span> <span class="o">+</span>
"<span class="n">wikipediaVecs</span><span class="o">/</span><span
class="n">tfidf</span><span class="o">-</span><span
class="n">vectors</span>"<span class="p">)</span>
</pre></div>
-<h2
id="extract-the-category-of-each-observation-and-aggregate-those-observations-by-category">Extract
the category of each observation and aggregate those observations by
category</h2>
+<h2
id="extract-the-category-of-each-observation-and-aggregate-those-observations-by-category">Extract
the category of each observation and aggregate those observations by
category<a class="headerlink"
href="#extract-the-category-of-each-observation-and-aggregate-those-observations-by-category"
title="Permanent link">¶</a></h2>
<div class="codehilite"><pre><span class="n">val</span> <span
class="p">(</span><span class="n">labelIndex</span><span class="p">,</span>
<span class="n">aggregatedObservations</span><span class="p">)</span> <span
class="p">=</span> <span class="n">SparkNaiveBayes</span><span
class="p">.</span><span
class="n">extractLabelsAndAggregateObservations</span><span class="p">(</span>
<span
class="n">fullData</span><span class="p">)</span>
</pre></div>
-<h2
id="build-a-muitinomial-naive-bayes-model-and-self-test-on-the-training-set">Build
a Muitinomial Naive Bayes model and self test on the training set</h2>
+<h2
id="build-a-muitinomial-naive-bayes-model-and-self-test-on-the-training-set">Build
a Muitinomial Naive Bayes model and self test on the training set<a
class="headerlink"
href="#build-a-muitinomial-naive-bayes-model-and-self-test-on-the-training-set"
title="Permanent link">¶</a></h2>
<div class="codehilite"><pre><span class="n">val</span> <span
class="n">model</span> <span class="p">=</span> <span
class="n">SparkNaiveBayes</span><span class="p">.</span><span
class="n">train</span><span class="p">(</span><span
class="n">aggregatedObservations</span><span class="p">,</span> <span
class="n">labelIndex</span><span class="p">,</span> <span
class="n">false</span><span class="p">)</span>
<span class="n">val</span> <span class="n">resAnalyzer</span> <span
class="p">=</span> <span class="n">SparkNaiveBayes</span><span
class="p">.</span><span class="n">test</span><span class="p">(</span><span
class="n">model</span><span class="p">,</span> <span
class="n">fullData</span><span class="p">,</span> <span
class="n">false</span><span class="p">)</span>
<span class="n">println</span><span class="p">(</span><span
class="n">resAnalyzer</span><span class="p">)</span>
@@ -320,7 +332,7 @@
<p>printing the <code>ResultAnalyzer</code> will display the confusion
matrix.</p>
-<h2 id="read-in-the-dictionary-and-document-frequency-count-from-hdfs">Read in
the dictionary and document frequency count from HDFS</h2>
+<h2 id="read-in-the-dictionary-and-document-frequency-count-from-hdfs">Read in
the dictionary and document frequency count from HDFS<a class="headerlink"
href="#read-in-the-dictionary-and-document-frequency-count-from-hdfs"
title="Permanent link">¶</a></h2>
<div class="codehilite"><pre><span class="n">val</span> <span
class="n">dictionary</span> <span class="p">=</span> <span
class="n">sdc</span><span class="p">.</span><span
class="n">sequenceFile</span><span class="p">(</span><span
class="n">pathToData</span> <span class="o">+</span> "<span
class="n">wikipediaVecs</span><span class="o">/</span><span
class="n">dictionary</span><span class="p">.</span><span
class="n">file</span><span class="o">-</span>0"<span class="p">,</span>
<span class="n">classOf</span><span
class="p">[</span><span class="n">Text</span><span class="p">],</span>
<span class="n">classOf</span><span
class="p">[</span><span class="n">IntWritable</span><span class="p">])</span>
@@ -344,7 +356,7 @@
</pre></div>
-<h2
id="define-a-function-to-tokenize-and-vectorize-new-text-using-our-current-dictionary">Define
a function to tokenize and vectorize new text using our current dictionary</h2>
+<h2
id="define-a-function-to-tokenize-and-vectorize-new-text-using-our-current-dictionary">Define
a function to tokenize and vectorize new text using our current dictionary<a
class="headerlink"
href="#define-a-function-to-tokenize-and-vectorize-new-text-using-our-current-dictionary"
title="Permanent link">¶</a></h2>
<p>For this simple example, our function <code>vectorizeDocument(...)</code>
will tokenize a new document into unigrams using native Java String methods and
vectorize using our dictionary and document frequencies. You could also use a
<a href="https://lucene.apache.org/core/">Lucene</a> analyzer for bigrams,
trigrams, etc., and integrate Apache <a
href="https://tika.apache.org/">Tika</a> to extract text from different
document types (PDF, PPT, XLS, etc.). Here, however we will keep it simple,
stripping and tokenizing our text using regexs and native String methods.</p>
<div class="codehilite"><pre>def vectorizeDocument<span
class="p">(</span>document: String<span class="p">,</span>
dictionaryMap: Map<span class="p">[</span>String<span
class="p">,</span>Int<span class="p">],</span>
@@ -376,7 +388,7 @@
</pre></div>
-<h2 id="setup-our-classifier">Setup our classifier</h2>
+<h2 id="setup-our-classifier">Setup our classifier<a class="headerlink"
href="#setup-our-classifier" title="Permanent link">¶</a></h2>
<div class="codehilite"><pre><span class="n">val</span> <span
class="n">labelMap</span> <span class="p">=</span> <span
class="n">model</span><span class="p">.</span><span class="n">labelIndex</span>
<span class="n">val</span> <span class="n">numLabels</span> <span
class="p">=</span> <span class="n">model</span><span class="p">.</span><span
class="n">numLabels</span>
<span class="n">val</span> <span class="n">reverseLabelMap</span> <span
class="p">=</span> <span class="n">labelMap</span><span class="p">.</span><span
class="n">map</span><span class="p">(</span><span class="n">x</span> <span
class="p">=</span><span class="o">></span> <span class="n">x</span><span
class="p">.</span><span class="n">_2</span> <span class="o">-></span> <span
class="n">x</span><span class="p">.</span><span class="n">_1</span><span
class="p">)</span>
@@ -389,7 +401,7 @@
</pre></div>
-<h2 id="define-an-argmax-function">Define an argmax function</h2>
+<h2 id="define-an-argmax-function">Define an argmax function<a
class="headerlink" href="#define-an-argmax-function" title="Permanent
link">¶</a></h2>
<p>The label with the highest score wins the classification for a given
document.</p>
<div class="codehilite"><pre>def argmax<span class="p">(</span>v: Vector<span
class="p">)</span>: <span class="p">(</span>Int<span class="p">,</span>
Double<span class="p">)</span> <span class="o">=</span> <span class="p">{</span>
var bestIdx: Int <span class="o">=</span> Integer.MIN_VALUE
@@ -405,7 +417,7 @@
</pre></div>
-<h2 id="define-our-tf-idf-vector-classifier">Define our TF(-IDF) vector
classifier</h2>
+<h2 id="define-our-tf-idf-vector-classifier">Define our TF(-IDF) vector
classifier<a class="headerlink" href="#define-our-tf-idf-vector-classifier"
title="Permanent link">¶</a></h2>
<div class="codehilite"><pre><span class="n">def</span> <span
class="n">classifyDocument</span><span class="p">(</span><span
class="n">clvec</span><span class="p">:</span> <span
class="n">Vector</span><span class="p">)</span> <span class="p">:</span> <span
class="n">String</span> <span class="p">=</span> <span class="p">{</span>
<span class="n">val</span> <span class="n">cvec</span> <span
class="p">=</span> <span class="n">classifier</span><span
class="p">.</span><span class="n">classifyFull</span><span
class="p">(</span><span class="n">clvec</span><span class="p">)</span>
<span class="n">val</span> <span class="p">(</span><span
class="n">bestIdx</span><span class="p">,</span> <span
class="n">bestScore</span><span class="p">)</span> <span class="p">=</span>
<span class="n">argmax</span><span class="p">(</span><span
class="n">cvec</span><span class="p">)</span>
@@ -414,7 +426,7 @@
</pre></div>
-<h2
id="two-sample-news-articles-united-states-football-and-united-kingdom-football">Two
sample news articles: United States Football and United Kingdom Football</h2>
+<h2
id="two-sample-news-articles-united-states-football-and-united-kingdom-football">Two
sample news articles: United States Football and United Kingdom Football<a
class="headerlink"
href="#two-sample-news-articles-united-states-football-and-united-kingdom-football"
title="Permanent link">¶</a></h2>
<div class="codehilite"><pre><span class="c1">// A random United States
football article</span>
<span class="c1">//
http://www.reuters.com/article/2015/01/28/us-nfl-superbowl-security-idUSKBN0L12JR20150128</span>
<span class="n">val</span> <span class="n">UStextToClassify</span> <span
class="o">=</span> <span class="k">new</span> <span
class="n">String</span><span class="p">(</span><span class="s">"(Reuters)
- Super Bowl security officials acknowledge"</span> <span
class="o">+</span>
@@ -483,7 +495,7 @@
</pre></div>
-<h2 id="vectorize-and-classify-our-documents">Vectorize and classify our
documents</h2>
+<h2 id="vectorize-and-classify-our-documents">Vectorize and classify our
documents<a class="headerlink" href="#vectorize-and-classify-our-documents"
title="Permanent link">¶</a></h2>
<div class="codehilite"><pre><span class="n">val</span> <span
class="n">usVec</span> <span class="p">=</span> <span
class="n">vectorizeDocument</span><span class="p">(</span><span
class="n">UStextToClassify</span><span class="p">,</span> <span
class="n">dictionaryMap</span><span class="p">,</span> <span
class="n">dfCountMap</span><span class="p">)</span>
<span class="n">val</span> <span class="n">ukVec</span> <span
class="p">=</span> <span class="n">vectorizeDocument</span><span
class="p">(</span><span class="n">UKtextToClassify</span><span
class="p">,</span> <span class="n">dictionaryMap</span><span class="p">,</span>
<span class="n">dfCountMap</span><span class="p">)</span>
@@ -495,7 +507,7 @@
</pre></div>
-<h2 id="tie-everything-together-in-a-new-method-to-classify-text">Tie
everything together in a new method to classify text</h2>
+<h2 id="tie-everything-together-in-a-new-method-to-classify-text">Tie
everything together in a new method to classify text<a class="headerlink"
href="#tie-everything-together-in-a-new-method-to-classify-text"
title="Permanent link">¶</a></h2>
<div class="codehilite"><pre><span class="n">def</span> <span
class="n">classifyText</span><span class="p">(</span><span
class="n">txt</span><span class="p">:</span> <span class="n">String</span><span
class="p">):</span> <span class="n">String</span> <span class="p">=</span>
<span class="p">{</span>
<span class="n">val</span> <span class="n">v</span> <span
class="p">=</span> <span class="n">vectorizeDocument</span><span
class="p">(</span><span class="n">txt</span><span class="p">,</span> <span
class="n">dictionaryMap</span><span class="p">,</span> <span
class="n">dfCountMap</span><span class="p">)</span>
<span class="n">classifyDocument</span><span class="p">(</span><span
class="n">v</span><span class="p">)</span>
@@ -503,13 +515,13 @@
</pre></div>
-<h2 id="now-we-can-simply-call-our-classifytext-method-on-any-string">Now we
can simply call our classifyText(...) method on any String</h2>
+<h2 id="now-we-can-simply-call-our-classifytext-method-on-any-string">Now we
can simply call our classifyText(...) method on any String<a class="headerlink"
href="#now-we-can-simply-call-our-classifytext-method-on-any-string"
title="Permanent link">¶</a></h2>
<div class="codehilite"><pre><span class="n">classifyText</span><span
class="p">(</span>"<span class="n">Hello</span> <span
class="n">world</span> <span class="n">from</span> <span
class="n">Queens</span>"<span class="p">)</span>
<span class="n">classifyText</span><span class="p">(</span>"<span
class="n">Hello</span> <span class="n">world</span> <span class="n">from</span>
<span class="n">London</span>"<span class="p">)</span>
</pre></div>
-<h2 id="model-persistance">Model persistance</h2>
+<h2 id="model-persistance">Model persistance<a class="headerlink"
href="#model-persistance" title="Permanent link">¶</a></h2>
<p>You can save the model to HDFS:</p>
<div class="codehilite"><pre><span class="n">model</span><span
class="p">.</span><span class="n">dfsWrite</span><span
class="p">(</span>"<span class="o">/</span><span
class="n">path</span><span class="o">/</span><span class="n">to</span><span
class="o">/</span><span class="n">model</span>"<span class="p">)</span>
</pre></div>
Modified:
websites/staging/mahout/trunk/content/users/environment/h2o-internals.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/environment/h2o-internals.html
(original)
+++ websites/staging/mahout/trunk/content/users/environment/h2o-internals.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,18 +264,29 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="introduction">Introduction</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="introduction">Introduction<a class="headerlink" href="#introduction"
title="Permanent link">¶</a></h1>
<p>This document provides an overview of how the Mahout Samsara environment is
implemented over the H2O backend engine. The document is aimed at Mahout
developers, to give a high level description of the design so that one can
explore the code inside <code>h2o/</code> with some context.</p>
-<h2 id="h2o-overview">H2O Overview</h2>
+<h2 id="h2o-overview">H2O Overview<a class="headerlink" href="#h2o-overview"
title="Permanent link">¶</a></h2>
<p>H2O is a distributed scalable machine learning system. Internal
architecture of H2O has a distributed math engine (h2o-core) and a separate
layer on top for algorithms and UI. The Mahout integration requires only the
math engine (h2o-core).</p>
-<h2 id="h2o-data-model">H2O Data Model</h2>
+<h2 id="h2o-data-model">H2O Data Model<a class="headerlink"
href="#h2o-data-model" title="Permanent link">¶</a></h2>
<p>The data model of the H2O math engine is a distributed columnar store (of
primarily numbers, but also strings). A column of numbers is called a Vector,
which is broken into Chunks (of a few thousand elements). Chunks are
distributed across the cluster based on a deterministic hash. Therefore, any
member of the cluster knows where a particular Chunk of a Vector is homed. Each
Chunk is separately compressed in memory and elements are individually
decompressed on the fly upon access with purely register operations (thereby
achieving high memory throughput). An ordered set of similarly partitioned Vecs
are composed into a Frame. A Frame is therefore a large two dimensional table
of numbers. All elements of a logical row in the Frame are guaranteed to be
homed in the same server of the cluster. Generally speaking, H2O works well on
"tall skinny" data, i.e, lots of rows (100s of millions) and modest number of
columns (10s of thousands).</p>
-<h2 id="mahout-drm">Mahout DRM</h2>
+<h2 id="mahout-drm">Mahout DRM<a class="headerlink" href="#mahout-drm"
title="Permanent link">¶</a></h2>
<p>The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a
large matrix of numbers in-memory in a cluster by distributing logical rows
among servers. Mahout's scala DSL provides an abstract API on DRMs for backend
engines to provide implementations of this API. Examples are the Spark and H2O
backend engines. Each engine has it's own design of mapping the abstract API
onto its data model and provides implementations for algebraic operators over
that mapping.</p>
-<h2 id="h2o-environment-engine">H2O Environment Engine</h2>
+<h2 id="h2o-environment-engine">H2O Environment Engine<a class="headerlink"
href="#h2o-environment-engine" title="Permanent link">¶</a></h2>
<p>The H2O backend implements the abstract DRM as an H2O Frame. Each logical
column in the DRM is an H2O Vector. All elements of a logical DRM row are
guaranteed to be homed on the same server. A set of rows stored on a server are
presented as a read-only virtual in-core Matrix (i.e BlockMatrix) for the
closure method in the <code>mapBlock(...)</code> API.</p>
<p>H2O provides a flexible execution framework called <code>MRTask</code>. The
<code>MRTask</code> framework typically executes over a Frame (or even a
Vector), supports various types of map() methods, can optionally modify the
Frame or Vector (though this never happens in the Mahout integration), and
optionally create a new Vector or set of Vectors (to combine them into a new
Frame, and consequently a new DRM).</p>
-<h2 id="source-layout">Source Layout</h2>
+<h2 id="source-layout">Source Layout<a class="headerlink"
href="#source-layout" title="Permanent link">¶</a></h2>
<p>Within mahout.git, the top level directory, <code>h2o/</code> holds all the
source code related to the H2O backend engine. Part of the code (that
interfaces with the rest of the Mahout componenets) is in Scala, and part of
the code (that interfaces with h2o-core and implements algebraic operators) is
in Java. Here is a brief overview of what functionality can be found where
within <code>h2o/</code>.</p>
<p>h2o/ - top level directory containing all H2O related code</p>
<p>h2o/src/main/java/org/apache/mahout/h2obindings/ops/*.java - Physical
operator code for the various DSL algebra</p>
Modified:
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
(original)
+++
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,10 +264,21 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="how-to-create-and-app-using-mahout">How to create and App using
Mahout</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="how-to-create-and-app-using-mahout">How to create and App using
Mahout<a class="headerlink" href="#how-to-create-and-app-using-mahout"
title="Permanent link">¶</a></h1>
<p>This is an example of how to create a simple app using Mahout as a Library.
The source is available on Github in the <a
href="https://github.com/pferrel/3-input-cooc">3-input-cooc project</a> with
more explanation about what it does (has to do with collaborative filtering).
For this tutorial we'll concentrate on the app rather than the data science.</p>
<p>The app reads in three user-item interactions types and creats indicators
for them using cooccurrence and cross-cooccurrence. The indicators will be
written to text files in a format ready for search engine indexing in search
engine based recommender.</p>
-<h2 id="setup">Setup</h2>
+<h2 id="setup">Setup<a class="headerlink" href="#setup" title="Permanent
link">¶</a></h2>
<p>In order to build and run the CooccurrenceDriver you need to install the
following:</p>
<ul>
<li>Install the Java 7 JDK from Oracle. Mac users look here: <a
href="http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html">Java
SE Development Kit 7u72</a>.</li>
@@ -276,7 +288,7 @@
</ul>
<p>Why install if you are only using them as a library? Certain binaries and
scripts are required by the libraries to get information about the environment
like discovering where jars are located.</p>
<p>Spark requires a set of jars on the classpath for the client side part of
an app and another set of jars must be passed to the Spark Context for running
distributed code. The example should discover all the neccessary classes
automatically.</p>
-<h2 id="application">Application</h2>
+<h2 id="application">Application<a class="headerlink" href="#application"
title="Permanent link">¶</a></h2>
<p>Using Mahout as a library in an application will require a little Scala
code. Scala has an App trait so we'll create an object, which inherits from
<code>App</code></p>
<div class="codehilite"><pre><span class="n">object</span> <span
class="n">CooccurrenceDriver</span> <span class="n">extends</span> <span
class="n">App</span> <span class="p">{</span>
<span class="p">}</span>
@@ -407,7 +419,7 @@ def writeIndicators<span class="p">(</sp
</pre></div>
-<h2 id="build">Build</h2>
+<h2 id="build">Build<a class="headerlink" href="#build" title="Permanent
link">¶</a></h2>
<p>Building the examples from project's root folder:</p>
<div class="codehilite"><pre>$ <span class="n">sbt</span> <span
class="n">pack</span>
</pre></div>
@@ -419,7 +431,7 @@ def writeIndicators<span class="p">(</sp
<p>The driver will execute in Spark standalone mode and put the data in
/path/to/3-input-cooc/data/indicators/<em>indicator-type</em></p>
-<h2 id="using-a-debugger">Using a Debugger</h2>
+<h2 id="using-a-debugger">Using a Debugger<a class="headerlink"
href="#using-a-debugger" title="Permanent link">¶</a></h2>
<p>To build and run this example in a debugger like IntelliJ IDEA. Install
from the IntelliJ site and add the Scala plugin.</p>
<p>Open IDEA and go to the menu File->New->Project from existing
sources->SBT->/path/to/3-input-cooc. This will create an IDEA project
from <code>build.sbt</code> in the root directory.</p>
<p>At this point you may create a "Debug Configuration" to run. In the menu
choose Run->Edit Configurations. Under "Default" choose "Application". In
the dialog hit the elipsis button "..." to the right of "Environment Variables"
and fill in your versions of JAVA_HOME, SPARK_HOME, and MAHOUT_HOME. In
configuration editor under "Use classpath from" choose root-3-input-cooc
module. </p>
@@ -427,7 +439,7 @@ def writeIndicators<span class="p">(</sp
<p>Now choose "Application" in the left pane and hit the plus sign "+". give
the config a name and hit the elipsis button to the right of the "Main class"
field as shown.</p>
<p><img alt="image" src="http://mahout.apache.org/images/debug-config-2.png"
/></p>
<p>After setting breakpoints you are now ready to debug the configuration. Go
to the Run->Debug... menu and pick your configuration. This will execute
using a local standalone instance of Spark.</p>
-<h2 id="the-mahout-shell">The Mahout Shell</h2>
+<h2 id="the-mahout-shell">The Mahout Shell<a class="headerlink"
href="#the-mahout-shell" title="Permanent link">¶</a></h2>
<p>For small script-like apps you may wish to use the Mahout shell. It is a
Scala REPL type interactive shell built on the Spark shell with Mahout-Samsara
extensions.</p>
<p>To make the CooccurrenceDriver.scala into a script make the following
changes:</p>
<ul>
Modified:
websites/staging/mahout/trunk/content/users/environment/in-core-reference.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/environment/in-core-reference.html
(original)
+++
websites/staging/mahout/trunk/content/users/environment/in-core-reference.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h2
id="mahout-samsaras-in-core-linear-algebra-dsl-reference">Mahout-Samsara's
In-Core Linear Algebra DSL Reference</h2>
-<h4 id="imports">Imports</h4>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h2 id="mahout-samsaras-in-core-linear-algebra-dsl-reference">Mahout-Samsara's
In-Core Linear Algebra DSL Reference<a class="headerlink"
href="#mahout-samsaras-in-core-linear-algebra-dsl-reference" title="Permanent
link">¶</a></h2>
+<h4 id="imports">Imports<a class="headerlink" href="#imports" title="Permanent
link">¶</a></h4>
<p>The following imports are used to enable Mahout-Samsara's Scala DSL
bindings for in-core Linear Algebra:</p>
<div class="codehilite"><pre><span class="n">import</span> <span
class="n">org</span><span class="p">.</span><span class="n">apache</span><span
class="p">.</span><span class="n">mahout</span><span class="p">.</span><span
class="n">math</span><span class="p">.</span><span class="n">_</span>
<span class="n">import</span> <span class="n">scalabindings</span><span
class="p">.</span><span class="n">_</span>
@@ -272,7 +284,7 @@
</pre></div>
-<h4 id="inline-initalization">Inline initalization</h4>
+<h4 id="inline-initalization">Inline initalization<a class="headerlink"
href="#inline-initalization" title="Permanent link">¶</a></h4>
<p>Dense vectors:</p>
<div class="codehilite"><pre>val densVec1: Vector = (1.0, 1.1, 1.2)
val denseVec2 = dvec(1, 0, 1,1 ,1,2)
@@ -314,7 +326,7 @@ val sparseVec1 = svec((5 -> 1.0) :: (
</pre></div>
-<h4 id="slicing-and-assigning">Slicing and Assigning</h4>
+<h4 id="slicing-and-assigning">Slicing and Assigning<a class="headerlink"
href="#slicing-and-assigning" title="Permanent link">¶</a></h4>
<p>Getting a vector element:</p>
<div class="codehilite"><pre><span class="n">val</span> <span
class="n">d</span> <span class="p">=</span> <span class="n">vec</span><span
class="p">(</span>5<span class="p">)</span>
</pre></div>
@@ -388,7 +400,7 @@ val sparseVec1 = svec((5 -> 1.0) :: (
</pre></div>
-<h4 id="blas-like-operations">BLAS-like operations</h4>
+<h4 id="blas-like-operations">BLAS-like operations<a class="headerlink"
href="#blas-like-operations" title="Permanent link">¶</a></h4>
<p>Plus/minus either vector or numeric with assignment or not:</p>
<div class="codehilite"><pre><span class="n">a</span> <span class="o">+</span>
<span class="n">b</span>
<span class="n">a</span> <span class="o">-</span> <span class="n">b</span>
@@ -472,7 +484,7 @@ val sparseVec1 = svec((5 -> 1.0) :: (
<p>will not therefore incur any additional data copying.</p>
-<h4 id="decompositions">Decompositions</h4>
+<h4 id="decompositions">Decompositions<a class="headerlink"
href="#decompositions" title="Permanent link">¶</a></h4>
<p>Matrix decompositions require an additional import:</p>
<div class="codehilite"><pre><span class="n">import</span> <span
class="n">org</span><span class="p">.</span><span class="n">apache</span><span
class="p">.</span><span class="n">mahout</span><span class="p">.</span><span
class="n">math</span><span class="p">.</span><span
class="n">decompositions</span><span class="p">.</span><span class="n">_</span>
</pre></div>
@@ -525,7 +537,7 @@ val sparseVec1 = svec((5 -> 1.0) :: (
</pre></div>
-<h4 id="misc">Misc</h4>
+<h4 id="misc">Misc<a class="headerlink" href="#misc" title="Permanent
link">¶</a></h4>
<p>Vector cardinality:</p>
<div class="codehilite"><pre><span class="n">a</span><span
class="p">.</span><span class="nb">length</span>
</pre></div>
@@ -550,7 +562,7 @@ val sparseVec1 = svec((5 -> 1.0) :: (
</pre></div>
-<h4 id="random-matrices">Random Matrices</h4>
+<h4 id="random-matrices">Random Matrices<a class="headerlink"
href="#random-matrices" title="Permanent link">¶</a></h4>
<p><code>\(\mathcal{U}\)</code>(0,1) random matrix view:</p>
<div class="codehilite"><pre><span class="n">val</span> <span
class="n">incCoreA</span> <span class="p">=</span> <span
class="n">Matrices</span><span class="p">.</span><span
class="n">uniformView</span><span class="p">(</span><span
class="n">m</span><span class="p">,</span> <span class="n">n</span><span
class="p">,</span> <span class="n">seed</span><span class="p">)</span>
</pre></div>
@@ -566,7 +578,7 @@ val sparseVec1 = svec((5 -> 1.0) :: (
</pre></div>
-<h4 id="iterators">Iterators</h4>
+<h4 id="iterators">Iterators<a class="headerlink" href="#iterators"
title="Permanent link">¶</a></h4>
<p>Mahout-Math already exposes a number of iterators. Scala code just needs
the following imports to enable implicit conversions to scala iterators.</p>
<div class="codehilite"><pre><span class="n">import</span> <span
class="n">collection</span><span class="p">.</span><span class="n">_</span>
<span class="n">import</span> <span class="n">JavaConversions</span><span
class="p">.</span><span class="n">_</span>