Modified:
websites/staging/mahout/trunk/content/users/environment/out-of-core-reference.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/environment/out-of-core-reference.html
(original)
+++
websites/staging/mahout/trunk/content/users/environment/out-of-core-reference.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,11 +264,22 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1
id="mahout-samsaras-distributed-linear-algebra-dsl-reference">Mahout-Samsara's
Distributed Linear Algebra DSL Reference</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1
id="mahout-samsaras-distributed-linear-algebra-dsl-reference">Mahout-Samsara's
Distributed Linear Algebra DSL Reference<a class="headerlink"
href="#mahout-samsaras-distributed-linear-algebra-dsl-reference"
title="Permanent link">¶</a></h1>
<p><strong>Note: this page is meant only as a quick reference to
Mahout-Samsara's R-Like DSL semantics. For more information, including
information on Mahout-Samsara's Algebraic Optimizer please see: <a
href="http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf">Mahout
Scala Bindings and Mahout Spark Bindings for Linear Algebra
Subroutines</a>.</strong></p>
<p>The subjects of this reference are solely applicable to Mahout-Samsara's
<strong>DRM</strong> (distributed row matrix).</p>
<p>In this reference, DRMs will be denoted as e.g. <code>A</code>, and in-core
matrices as e.g. <code>inCoreA</code>.</p>
-<h4 id="imports">Imports</h4>
+<h4 id="imports">Imports<a class="headerlink" href="#imports" title="Permanent
link">¶</a></h4>
<p>The following imports are used to enable seamless in-core and distributed
algebraic DSL operations:</p>
<div class="codehilite"><pre><span class="n">import</span> <span
class="n">org</span><span class="p">.</span><span class="n">apache</span><span
class="p">.</span><span class="n">mahout</span><span class="p">.</span><span
class="n">math</span><span class="p">.</span><span class="n">_</span>
<span class="n">import</span> <span class="n">scalabindings</span><span
class="p">.</span><span class="n">_</span>
@@ -289,7 +301,7 @@
<p>The Mahout shell does all of these imports automatically.</p>
-<h4 id="drm-persistence-operators">DRM Persistence operators</h4>
+<h4 id="drm-persistence-operators">DRM Persistence operators<a
class="headerlink" href="#drm-persistence-operators" title="Permanent
link">¶</a></h4>
<p><strong>Mahout-Samsara's DRM persistance to HDFS is compatible with all
Mahout-MapReduce algorithms such as seq2sparse.</strong></p>
<p>Loading a DRM from (HD)FS:</p>
<div class="codehilite"><pre><span class="n">drmDfsRead</span><span
class="p">(</span><span class="n">path</span> <span class="p">=</span> <span
class="n">hdfsPath</span><span class="p">)</span>
@@ -325,7 +337,7 @@ val inCoreC: Matrix = inCoreA %*%: drmB
</pre></div>
-<h4 id="logical-algebraic-operators-on-drm-matrices">Logical algebraic
operators on DRM matrices:</h4>
+<h4 id="logical-algebraic-operators-on-drm-matrices">Logical algebraic
operators on DRM matrices:<a class="headerlink"
href="#logical-algebraic-operators-on-drm-matrices" title="Permanent
link">¶</a></h4>
<p>A logical set of operators are defined for distributed matrices as a subset
of those defined for in-core matrices. In particular, since all distributed
matrices are immutable, there are no assignment operators (e.g. <strong>A +=
B</strong>)
<em>Note: please see: <a
href="http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf">Mahout
Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines</a>
for information on Mahout-Samsars's Algebraic Optimizer, and translation from
logical operations to a physical plan for the back end.</em></p>
<p>Cache a DRM and trigger an optimized physical plan: </p>
@@ -420,7 +432,7 @@ Elementwise operations of every matrix e
<p>Note that <code>5.0 -: A</code> means <code>\(m_{ij} = 5 - a_{ij}\)</code>
and <code>5.0 /: A</code> means <code>\(m_{ij} = \frac{5}{a{ij}}\)</code> for
all elements of the result.</p>
-<h4 id="slicing">Slicing</h4>
+<h4 id="slicing">Slicing<a class="headerlink" href="#slicing" title="Permanent
link">¶</a></h4>
<p>General slice:</p>
<div class="codehilite"><pre><span class="n">A</span><span
class="p">(</span>100 <span class="n">to</span> 200<span class="p">,</span> 100
<span class="n">to</span> 200<span class="p">)</span>
</pre></div>
@@ -437,7 +449,7 @@ Elementwise operations of every matrix e
<p><em>Note: if row range is not all-range (::) the the DRM must be
<code>Int</code>-keyed. General case row slicing is not supported by DRMs with
key types other than <code>Int</code></em>.</p>
-<h4 id="stitching">Stitching</h4>
+<h4 id="stitching">Stitching<a class="headerlink" href="#stitching"
title="Permanent link">¶</a></h4>
<p>Stitch side by side (cbind R semantics):</p>
<div class="codehilite"><pre><span class="n">val</span> <span
class="n">drmAnextToB</span> <span class="p">=</span> <span
class="n">drmA</span> <span class="n">cbind</span> <span class="n">drmB</span>
</pre></div>
@@ -449,7 +461,7 @@ Elementwise operations of every matrix e
<p>Analogously, vertical concatenation is available via
<strong>rbind</strong></p>
-<h4 id="custom-pipelines-on-blocks">Custom pipelines on blocks</h4>
+<h4 id="custom-pipelines-on-blocks">Custom pipelines on blocks<a
class="headerlink" href="#custom-pipelines-on-blocks" title="Permanent
link">¶</a></h4>
<p>Internally, Mahout-Samsara's DRM is represented as a distributed set of
vertical (Key, Block) tuples.</p>
<p><strong>drm.mapBlock(...)</strong>:</p>
<p>The DRM operator <code>mapBlock</code> provides transformational access to
the distributed vertical blockified tuples of a matrix (Row-Keys,
Vertical-Matrix-Block).</p>
@@ -462,7 +474,7 @@ Elementwise operations of every matrix e
</pre></div>
-<h4 id="broadcasting-vectors-and-matrices-to-closures">Broadcasting Vectors
and matrices to closures</h4>
+<h4 id="broadcasting-vectors-and-matrices-to-closures">Broadcasting Vectors
and matrices to closures<a class="headerlink"
href="#broadcasting-vectors-and-matrices-to-closures" title="Permanent
link">¶</a></h4>
<p>Generally we can create and use one-way closure attributes to be used on
the back end.</p>
<p>Scalar matrix multiplication:</p>
<div class="codehilite"><pre>val factor: Int = 15
@@ -484,7 +496,7 @@ val drm2 <span class="o">=</span> drm1.m
</pre></div>
-<h4 id="computations-providing-ad-hoc-summaries">Computations providing ad-hoc
summaries</h4>
+<h4 id="computations-providing-ad-hoc-summaries">Computations providing ad-hoc
summaries<a class="headerlink" href="#computations-providing-ad-hoc-summaries"
title="Permanent link">¶</a></h4>
<p>Matrix cardinality:</p>
<div class="codehilite"><pre><span class="n">drmA</span><span
class="p">.</span><span class="n">nrow</span>
<span class="n">drmA</span><span class="p">.</span><span class="n">ncol</span>
@@ -501,7 +513,7 @@ val drm2 <span class="o">=</span> drm1.m
<p><em>Note: These will always trigger a computational action. I.e. if one
calls <code>colSums()</code> n times, then the back end will actually recompute
<code>colSums</code> n times.</em></p>
-<h4 id="distributed-matrix-decompositions">Distributed Matrix
Decompositions</h4>
+<h4 id="distributed-matrix-decompositions">Distributed Matrix Decompositions<a
class="headerlink" href="#distributed-matrix-decompositions" title="Permanent
link">¶</a></h4>
<p>To import the decomposition package:</p>
<div class="codehilite"><pre><span class="n">import</span> <span
class="n">org</span><span class="p">.</span><span class="n">apache</span><span
class="p">.</span><span class="n">mahout</span><span class="p">.</span><span
class="n">math</span><span class="p">.</span><span class="n">_</span>
<span class="n">import</span> <span class="n">decompositions</span><span
class="p">.</span><span class="n">_</span>
@@ -532,7 +544,7 @@ val drm2 <span class="o">=</span> drm1.m
</pre></div>
-<h4 id="adjusting-parallelism-of-computations">Adjusting parallelism of
computations</h4>
+<h4 id="adjusting-parallelism-of-computations">Adjusting parallelism of
computations<a class="headerlink" href="#adjusting-parallelism-of-computations"
title="Permanent link">¶</a></h4>
<p>Set the minimum parallelism to 100 for computations on
<code>drmA</code>:</p>
<div class="codehilite"><pre><span class="n">drmA</span><span
class="p">.</span><span class="n">par</span><span class="p">(</span><span
class="n">min</span> <span class="p">=</span> 100<span class="p">)</span>
</pre></div>
@@ -548,7 +560,7 @@ val drm2 <span class="o">=</span> drm1.m
</pre></div>
-<h4
id="retrieving-the-engine-specific-data-structure-backing-the-drm">Retrieving
the engine specific data structure backing the DRM:</h4>
+<h4
id="retrieving-the-engine-specific-data-structure-backing-the-drm">Retrieving
the engine specific data structure backing the DRM:<a class="headerlink"
href="#retrieving-the-engine-specific-data-structure-backing-the-drm"
title="Permanent link">¶</a></h4>
<p><strong>A Spark RDD:</strong></p>
<div class="codehilite"><pre><span class="n">val</span> <span
class="n">myRDD</span> <span class="p">=</span> <span
class="n">drmA</span><span class="p">.</span><span
class="n">checkpoint</span><span class="p">().</span><span class="n">rdd</span>
</pre></div>
Modified:
websites/staging/mahout/trunk/content/users/environment/spark-internals.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/environment/spark-internals.html
(original)
+++
websites/staging/mahout/trunk/content/users/environment/spark-internals.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,14 +264,25 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="introduction">Introduction</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="introduction">Introduction<a class="headerlink" href="#introduction"
title="Permanent link">¶</a></h1>
<p>This document provides an overview of how the Mahout Scala DSL (distributed
algebraic operators) is implemented over the Spark back end engine. The
document is aimed at Mahout developers, to give a high level description of the
design. </p>
-<h2 id="spark-overview">Spark Overview</h2>
-<h2 id="spark-data-model">Spark Data Model</h2>
-<h2 id="mahout-drm">Mahout DRM</h2>
+<h2 id="spark-overview">Spark Overview<a class="headerlink"
href="#spark-overview" title="Permanent link">¶</a></h2>
+<h2 id="spark-data-model">Spark Data Model<a class="headerlink"
href="#spark-data-model" title="Permanent link">¶</a></h2>
+<h2 id="mahout-drm">Mahout DRM<a class="headerlink" href="#mahout-drm"
title="Permanent link">¶</a></h2>
<p>Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a
large matrix of numbers in-memory in a cluster by distributing logical rows
among servers. The DSL provides an abstract API on DRMs for backend engines to
provide implementations of this API. Examples are Spark and H2O backend
engines. Each engine has its own design of mapping the abstract API onto its
data model and provide implementations for algebraic operators over that
mapping.</p>
-<h2 id="spark-dsl-engine">Spark DSL Engine</h2>
-<h2 id="source-layout">Source Layout</h2>
+<h2 id="spark-dsl-engine">Spark DSL Engine<a class="headerlink"
href="#spark-dsl-engine" title="Permanent link">¶</a></h2>
+<h2 id="source-layout">Source Layout<a class="headerlink"
href="#source-layout" title="Permanent link">¶</a></h2>
</div>
</div>
</div>
Modified:
websites/staging/mahout/trunk/content/users/flinkbindings/flink-internals.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/flinkbindings/flink-internals.html
(original)
+++
websites/staging/mahout/trunk/content/users/flinkbindings/flink-internals.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
Modified: websites/staging/mahout/trunk/content/users/misc/mr---map-reduce.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/misc/mr---map-reduce.html
(original)
+++ websites/staging/mahout/trunk/content/users/misc/mr---map-reduce.html Fri
Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p>{excerpt}MapReduce is a framework for processing huge datasets on
certain
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>{excerpt}MapReduce is a framework for processing huge datasets on certain
kinds of distributable problems using a large number of computers (nodes),
collectively referred to as a cluster.{excerpt} Computational processing
can occur on data stored either in a filesystem (unstructured) or within a
Modified:
websites/staging/mahout/trunk/content/users/misc/parallel-frequent-pattern-mining.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/misc/parallel-frequent-pattern-mining.html
(original)
+++
websites/staging/mahout/trunk/content/users/misc/parallel-frequent-pattern-mining.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p>Mahout has a Top K Parallel FPGrowth Implementation. Its based on the
paper <a
href="http://infolab.stanford.edu/~echang/recsys08-69.pdf">http://infolab.stanford.edu/~echang/recsys08-69.pdf</a>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>Mahout has a Top K Parallel FPGrowth Implementation. Its based on the paper
<a
href="http://infolab.stanford.edu/~echang/recsys08-69.pdf">http://infolab.stanford.edu/~echang/recsys08-69.pdf</a>
with some optimisations in mining the data.</p>
<p>Given a huge transaction list, the algorithm finds all unique features(sets
of field values) and eliminates those features whose frequency in the whole
@@ -311,7 +323,7 @@ class which takes care of storing the ob
File Output format</li>
</ul>
<p><a
name="ParallelFrequentPatternMining-RunningFrequentPatternGrowthviacommandline"></a></p>
-<h2 id="running-frequent-pattern-growth-via-command-line">Running Frequent
Pattern Growth via command line</h2>
+<h2 id="running-frequent-pattern-growth-via-command-line">Running Frequent
Pattern Growth via command line<a class="headerlink"
href="#running-frequent-pattern-growth-via-command-line" title="Permanent
link">¶</a></h2>
<p>The command line launcher for string transaction data
org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver has other features including
specifying the regex pattern for spitting a string line of a transaction
@@ -319,7 +331,7 @@ into the constituent features.</p>
<p>Input files have to be in the following format.</p>
<p><optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE....</p>
<p>instead of tab you could use , or \| as the default tokenization is done
using a java Regex pattern {code}<a href=",\t.html">,\t</a>
-<em>[,|\t][ ,\t]</em>{code}
+<em code="code">[,|\t][ ,\t]</em>
You can override this parameter to parse your log files or transaction
files (each line is a transaction.) The FPGrowth algorithm mines the top K
frequently occurring sets of items and their counts from the given input
@@ -350,7 +362,7 @@ gz file or even a directory containing a
We modified the regex to use space to split the token. Note that input
regex string is escaped.</p>
<p><a name="ParallelFrequentPatternMining-RunningParallelFPGrowth"></a></p>
-<h2 id="running-parallel-fpgrowth">Running Parallel FPGrowth</h2>
+<h2 id="running-parallel-fpgrowth">Running Parallel FPGrowth<a
class="headerlink" href="#running-parallel-fpgrowth" title="Permanent
link">¶</a></h2>
<p>Running parallel FPGrowth is as easy as adding changing the flag -method
mapreduce and adding the number of groups parameter e.g. -g 20 for 20
groups. First, let's run the above sample test in map-reduce mode:</p>
@@ -417,7 +429,7 @@ consumption but might improve speed unti
entirely on the dataset in question. A value of 5-10 is recommended for
mining up to top 100 patterns for each feature.</p>
<p><a name="ParallelFrequentPatternMining-Viewingtheresults"></a></p>
-<h2 id="viewing-the-results">Viewing the results</h2>
+<h2 id="viewing-the-results">Viewing the results<a class="headerlink"
href="#viewing-the-results" title="Permanent link">¶</a></h2>
<p>The output will be dumped to a SequenceFile in the frequentpatterns
directory in Text=>TopKStringPatterns format. Run this command to see a few
of the Frequent Patterns:</p>
Modified:
websites/staging/mahout/trunk/content/users/misc/perceptron-and-winnow.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/misc/perceptron-and-winnow.html
(original)
+++ websites/staging/mahout/trunk/content/users/misc/perceptron-and-winnow.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a
name="PerceptronandWinnow-ClassificationwithPerceptronorWinnow"></a></p>
-<h1 id="classification-with-perceptron-or-winnow">Classification with
Perceptron or Winnow</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="PerceptronandWinnow-ClassificationwithPerceptronorWinnow"></a></p>
+<h1 id="classification-with-perceptron-or-winnow">Classification with
Perceptron or Winnow<a class="headerlink"
href="#classification-with-perceptron-or-winnow" title="Permanent
link">¶</a></h1>
<p>Both algorithms are comparably simple linear classifiers. Given training
data in some n-dimensional vector space that is annotated with binary
labels the algorithms are guaranteed to find a linear separating hyperplane
@@ -280,12 +292,12 @@ In contrast to Naive Bayes they are not
features (in the domain of text classification: all terms in a document)
are independent.</p>
<p><a name="PerceptronandWinnow-Strategyforparallelisation"></a></p>
-<h2 id="strategy-for-parallelisation">Strategy for parallelisation</h2>
+<h2 id="strategy-for-parallelisation">Strategy for parallelisation<a
class="headerlink" href="#strategy-for-parallelisation" title="Permanent
link">¶</a></h2>
<p>Currently the strategy for parallelisation is simple: Given there is enough
training data, split the training data. Train the classifier on each split.
The resulting hyperplanes are then averaged.</p>
<p><a name="PerceptronandWinnow-Roadmap"></a></p>
-<h2 id="roadmap">Roadmap</h2>
+<h2 id="roadmap">Roadmap<a class="headerlink" href="#roadmap" title="Permanent
link">¶</a></h2>
<p>Currently the patch only contains the code for the classifier itself. It is
planned to provide unit tests and at least one example based on the WebKB
dataset by the end of November for the serial version. After that the
Modified: websites/staging/mahout/trunk/content/users/misc/testing.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/misc/testing.html (original)
+++ websites/staging/mahout/trunk/content/users/misc/testing.html Fri Apr 8
18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,12 +264,23 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="Testing-Intro"></a></p>
-<h1 id="intro">Intro</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="Testing-Intro"></a></p>
+<h1 id="intro">Intro<a class="headerlink" href="#intro" title="Permanent
link">¶</a></h1>
<p>As Mahout matures, solid testing procedures are needed. This page and its
children capture test plans along with ideas for improving our testing.</p>
<p><a name="Testing-TestPlans"></a></p>
-<h1 id="test-plans">Test Plans</h1>
+<h1 id="test-plans">Test Plans<a class="headerlink" href="#test-plans"
title="Permanent link">¶</a></h1>
<ul>
<li><a href="0.6.html">0.6</a></li>
<li>Test Plans for the 0.6 release
@@ -276,9 +288,9 @@ There are no special plans except for un
Hadoop jobs.</li>
</ul>
<p><a name="Testing-TestIdeas"></a></p>
-<h1 id="test-ideas">Test Ideas</h1>
+<h1 id="test-ideas">Test Ideas<a class="headerlink" href="#test-ideas"
title="Permanent link">¶</a></h1>
<p><a name="Testing-Regressions/Benchmarks/Integrations"></a></p>
-<h2
id="regressionsbenchmarksintegrations">Regressions/Benchmarks/Integrations</h2>
+<h2
id="regressionsbenchmarksintegrations">Regressions/Benchmarks/Integrations<a
class="headerlink" href="#regressionsbenchmarksintegrations" title="Permanent
link">¶</a></h2>
<ul>
<li>Algorithmic quality and speed are not tested, except in a few instances.
Such tests often require much longer run times (minutes to hours), a
@@ -290,14 +302,14 @@ S3, JDBC, Cassandra, etc. </li>
<p>Apache Jenkins is not able to support these environments. Commercial
donations would help. </p>
<p><a name="Testing-UnitTests"></a></p>
-<h2 id="unit-tests">Unit Tests</h2>
+<h2 id="unit-tests">Unit Tests<a class="headerlink" href="#unit-tests"
title="Permanent link">¶</a></h2>
<p>Mahout's current tests are almost entirely unit tests. Algorithm tests
generally supply a few numbers to code paths and verify that expected
numbers come out. 'mvn test' runs these tests. There is "positive" coverage
of a great many utilities and algorithms. A much smaller percent include
"negative" coverage (bogus setups, inputs, combinations).</p>
<p><a name="Testing-Other"></a></p>
-<h2 id="other">Other</h2>
+<h2 id="other">Other<a class="headerlink" href="#other" title="Permanent
link">¶</a></h2>
</div>
</div>
</div>
Modified:
websites/staging/mahout/trunk/content/users/misc/using-mahout-with-python-via-jpype.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/misc/using-mahout-with-python-via-jpype.html
(original)
+++
websites/staging/mahout/trunk/content/users/misc/using-mahout-with-python-via-jpype.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="UsingMahoutwithPythonviaJPype-overview"></a></p>
-<h1 id="mahout-over-jython-some-examples">Mahout over Jython - some
examples</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="UsingMahoutwithPythonviaJPype-overview"></a></p>
+<h1 id="mahout-over-jython-some-examples">Mahout over Jython - some examples<a
class="headerlink" href="#mahout-over-jython-some-examples" title="Permanent
link">¶</a></h1>
<p>This tutorial provides some sample code illustrating how we can read and
write sequence files containing Mahout vectors from Python using JPype.
This tutorial is intended for people who want to use Python for analyzing
@@ -299,7 +311,7 @@ python script. The result for me looks l
<p><a
name="UsingMahoutwithPythonviaJPype-WritingNamedVectorstoSequenceFilesfromPython"></a></p>
-<h1 id="writing-named-vectors-to-sequence-files-from-python">Writing Named
Vectors to Sequence Files from Python</h1>
+<h1 id="writing-named-vectors-to-sequence-files-from-python">Writing Named
Vectors to Sequence Files from Python<a class="headerlink"
href="#writing-named-vectors-to-sequence-files-from-python" title="Permanent
link">¶</a></h1>
<p>We can now use JPype to create sequence files which will contain vectors to
be used by Mahout for kmeans. The example below is a function which creates
vectors from two Gaussian distributions with unit variance.</p>
@@ -370,7 +382,7 @@ vectors from two Gaussian distributions
<p><a
name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansClusteredPointsfromPython"></a></p>
-<h1 id="reading-the-kmeans-clustered-points-from-python">Reading the KMeans
Clustered Points from Python</h1>
+<h1 id="reading-the-kmeans-clustered-points-from-python">Reading the KMeans
Clustered Points from Python<a class="headerlink"
href="#reading-the-kmeans-clustered-points-from-python" title="Permanent
link">¶</a></h1>
<p>Similarly we can use JPype to easily read the clustered points outputted by
mahout.</p>
<div class="codehilite"><pre><span class="n">def</span> <span
class="n">read_clustered_pts</span><span class="p">(</span><span
class="n">ifile</span><span class="p">,</span><span class="o">*</span><span
class="n">args</span><span class="p">,</span><span class="o">**</span><span
class="n">param</span><span class="p">):</span>
@@ -420,7 +432,7 @@ mahout.</p>
<p><a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansCentroids"></a></p>
-<h1 id="reading-the-kmeans-centroids">Reading the KMeans Centroids</h1>
+<h1 id="reading-the-kmeans-centroids">Reading the KMeans Centroids<a
class="headerlink" href="#reading-the-kmeans-centroids" title="Permanent
link">¶</a></h1>
<p>Finally we can create a function to print out the actual cluster centers
found by mahout,</p>
<div class="codehilite"><pre><span class="n">def</span> <span
class="n">getClusters</span><span class="p">(</span><span
class="n">ifile</span><span class="p">,</span><span class="o">*</span><span
class="n">args</span><span class="p">,</span><span class="o">**</span><span
class="n">param</span><span class="p">):</span>
Modified:
websites/staging/mahout/trunk/content/users/recommender/intro-als-hadoop.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/recommender/intro-als-hadoop.html
(original)
+++
websites/staging/mahout/trunk/content/users/recommender/intro-als-hadoop.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
Modified:
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
(original)
+++
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
Modified:
websites/staging/mahout/trunk/content/users/recommender/intro-itembased-hadoop.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/recommender/intro-itembased-hadoop.html
(original)
+++
websites/staging/mahout/trunk/content/users/recommender/intro-itembased-hadoop.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1
id="introduction-to-item-based-recommendations-with-hadoop">Introduction to
Item-Based Recommendations with Hadoop</h1>
-<h2 id="overview">Overview</h2>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="introduction-to-item-based-recommendations-with-hadoop">Introduction
to Item-Based Recommendations with Hadoop<a class="headerlink"
href="#introduction-to-item-based-recommendations-with-hadoop" title="Permanent
link">¶</a></h1>
+<h2 id="overview">Overview<a class="headerlink" href="#overview"
title="Permanent link">¶</a></h2>
<p>Mahoutâs item based recommender is a flexible and easily implemented
algorithm with a diverse range of applications. The minimalism of the primary
input fileâs structure and availability of ancillary filtering controls can
make sourcing required data and shaping a desired output both efficient and
straightforward.</p>
<p>Typical use cases include:</p>
<ul>
@@ -282,7 +294,7 @@
<li>Map product substitutions into the Mahout input (i.e. if WidgetA is a
recommended item replace it with WidgetX)</li>
</ul>
<p>The item based recommender output can be easily consumed by downstream
applications (i.e. websites, ERP systems or salesforce automation tools) and is
configurable so users can determine the number of item recommendations
generated by the algorithm.</p>
-<h2 id="example">Example</h2>
+<h2 id="example">Example<a class="headerlink" href="#example" title="Permanent
link">¶</a></h2>
<p>Testing the item based recommender can be a simple and potentially quite
rewarding endeavor. Whereas the typical sample use case for collaborative
filtering focuses on utilization of, and integration with, eCommerce platforms
we can instead look at a potential use case applicable to most businesses (even
those without a web presence). Letâs look at how a company might use
Mahoutâs item based recommender to identify new sales opportunities for an
existing customer base. First, youâll need to get Mahout up and running, the
instructions for which can be found <a
href="https://mahout.apache.org/users/basics/quickstart.html">here</a>. After
you've ensured Mahout is properly installed, weâre ready to run a quick
example.</p>
<p><strong>Step 1: Gather some test data</strong></p>
<p>Mahoutâs item based recommender relies on three key pieces of data:
<em>userID</em>, <em>itemID</em> and <em>preference</em>. The âusersâ could
be website visitors or simply customers that purchase products from your
business. Similarly, items could be products, product groups or even pages on
your website â really anything you would want to recommend to a group of
users or customers. For our example letâs use customer orders as a proxy for
preference. A simple count of distinct orders by customer, by product will work
for this example. Youâll find as you explore ways to manipulate the item
based recommender the preference value can be many things (page clicks,
explicit ratings, order counts, etc.). Once your test data is gathered put it
in a <em>.txt</em> file separated by commas with no column headers included.</p>
Modified:
websites/staging/mahout/trunk/content/users/recommender/matrix-factorization.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/recommender/matrix-factorization.html
(original)
+++
websites/staging/mahout/trunk/content/users/recommender/matrix-factorization.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="MatrixFactorization-Intro"></a></p>
-<h1
id="introduction-to-matrix-factorization-for-recommendation-mining">Introduction
to Matrix Factorization for Recommendation Mining</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="MatrixFactorization-Intro"></a></p>
+<h1
id="introduction-to-matrix-factorization-for-recommendation-mining">Introduction
to Matrix Factorization for Recommendation Mining<a class="headerlink"
href="#introduction-to-matrix-factorization-for-recommendation-mining"
title="Permanent link">¶</a></h1>
<p>In the mathematical discipline of linear algebra, a matrix decomposition
or matrix factorization is a dimensionality reduction technique that
factorizes a matrix into a product of matrices, usually two.
There are many different matrix decompositions, each finds use among a
particular class of problems.</p>
@@ -297,7 +309,7 @@ So our matrix factorization target could
</pre></div>
-<h2 id="sgd">SGD</h2>
+<h2 id="sgd">SGD<a class="headerlink" href="#sgd" title="Permanent
link">¶</a></h2>
<p>Stochastic gradient descent is a gradient descent optimization method for
minimizing an objective function that is written as a su of differentiable
functions.</p>
<div class="codehilite"><pre> <span class="n">Q</span><span
class="p">(</span><span class="n">w</span><span class="p">)</span> <span
class="p">=</span> <span class="n">sum</span><span class="p">(</span><span
class="n">Q_i</span><span class="p">(</span><span class="n">w</span><span
class="p">)),</span>
</pre></div>
@@ -348,7 +360,7 @@ So our matrix factorization target could
</pre></div>
-<h2 id="svd">SVD++</h2>
+<h2 id="svd">SVD++<a class="headerlink" href="#svd" title="Permanent
link">¶</a></h2>
<p>SVD++ is an enhancement of the SGD matrix factorization. </p>
<p>It could be considered as an integration of latent factor model and
neighborhood based model, considering not only how users rate, but also who has
rated what. </p>
<p>The complete model is a sum of 3 sub-models with complete prediction
formula as follows: </p>
@@ -393,13 +405,13 @@ please refer to the paper <a href="http:
<p>where alpha is the learning rate of gradient descent, N(u) is the items
that user u has expressed preference.</p>
-<h2 id="parallel-sgd">Parallel SGD</h2>
+<h2 id="parallel-sgd">Parallel SGD<a class="headerlink" href="#parallel-sgd"
title="Permanent link">¶</a></h2>
<p>Mahout has a parallel SGD implementation in ParallelSGDFactorizer class. It
shuffles the user ratings in every iteration and
generates splits on the shuffled ratings. Each split is handled by a thread to
update the user features and item features using
vanilla SGD. </p>
<p>The implementation could be traced back to a lock-free version of SGD based
on paper
<a href="http://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf">Hogwild!:
A Lock-Free Approach to Parallelizing Stochastic Gradient Descent</a>.</p>
-<h2 id="alswr">ALSWR</h2>
+<h2 id="alswr">ALSWR<a class="headerlink" href="#alswr" title="Permanent
link">¶</a></h2>
<p>ALSWR is an iterative algorithm to solve the low rank factorization of user
feature matrix U and item feature matrix M.<br />
The loss function to be minimized is formulated as the sum of squared errors
plus <a href="http://en.wikipedia.org/wiki/Tikhonov_regularization">Tikhonov
regularization</a>:</p>
<div class="codehilite"><pre> <span class="n">L</span><span
class="p">(</span><span class="n">R</span><span class="p">,</span> <span
class="n">U</span><span class="p">,</span> <span class="n">M</span><span
class="p">)</span> <span class="p">=</span> <span class="n">sum</span><span
class="p">(</span><span class="n">pow</span><span class="p">((</span><span
class="n">R</span><span class="p">[</span><span class="n">u</span><span
class="p">,</span><span class="nb">i</span><span class="p">]</span> <span
class="o">-</span> <span class="n">U</span><span class="p">[</span><span
class="n">u</span><span class="p">,]</span><span class="o">*</span> <span
class="p">(</span><span class="n">M</span><span class="p">[</span><span
class="nb">i</span><span class="p">,]</span>^<span class="n">t</span><span
class="p">)),</span> 2<span class="p">))</span> <span class="o">+</span> <span
class="n">lambda</span> <span class="o">*</span> <span class="p">(</span><span
class="n">sum</span><span class="p">(</spa
n><span class="n">n</span><span class="p">(</span><span
class="n">u</span><span class="p">)</span> <span class="o">*</span> <span
class="o">||</span><span class="n">U</span><span class="p">[</span><span
class="n">u</span><span class="p">,]</span><span class="o">||</span>^2<span
class="p">)</span> <span class="o">+</span> <span class="n">sum</span><span
class="p">(</span><span class="n">n</span><span class="p">(</span><span
class="nb">i</span><span class="p">)</span> <span class="o">*</span> <span
class="o">||</span><span class="n">M</span><span class="p">[</span><span
class="nb">i</span><span class="p">,]</span><span class="o">||</span>^2<span
class="p">))</span>
@@ -424,7 +436,7 @@ item and their feature vectors:</p>
<p>The ALSWRFactorizer class is a non-distributed implementation of ALSWR
using multi-threading to dispatch the computation among several threads.
Mahout also offers a <a
href="https://mahout.apache.org/users/recommender/intro-als-hadoop.html">parallel
map-reduce implementation</a>.</p>
<p><a name="MatrixFactorization-Reference"></a></p>
-<h1 id="reference">Reference:</h1>
+<h1 id="reference">Reference:<a class="headerlink" href="#reference"
title="Permanent link">¶</a></h1>
<p><a
href="http://en.wikipedia.org/wiki/Stochastic_gradient_descent">Stochastic
gradient descent</a></p>
<p><a
href="http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf">ALSWR</a></p>
</div>
Modified:
websites/staging/mahout/trunk/content/users/recommender/quickstart.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/recommender/quickstart.html
(original)
+++ websites/staging/mahout/trunk/content/users/recommender/quickstart.html Fri
Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,14 +264,25 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="recommender-overview">Recommender Overview</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="recommender-overview">Recommender Overview<a class="headerlink"
href="#recommender-overview" title="Permanent link">¶</a></h1>
<p>Recommenders have changed over the years. Mahout contains a long list of
them, which you can still use. But to get the best out of our more modern
aproach we'll need to think of the Recommender as a "model creation"
component—supplied by Mahout's new spark-itemsimilarity job, and a
"serving" component—supplied by a modern scalable search engine, like
Solr.</p>
<p><img alt="image" src="http://i.imgur.com/fliHMBo.png" /></p>
<p>To integrate with your application you will collect user interactions
storing them in a DB and also in a from usable by Mahout. The simplest way to
do this is to log user interactions to csv files (user-id, item-id). The DB
should be setup to contain the last n user interactions, which will form part
of the query for recommendations.</p>
<p>Mahout's spark-itemsimilarity will create a table of (item-id,
list-of-similar-items) in csv form. Think of this as an item collection with
one field containing the item-ids of similar items. Index this with your search
engine. </p>
<p>When your application needs recommendations for a specific person, get the
latest user history of interactions from the DB and query the indicator
collection with this history. You will get back an ordered list of item-ids.
These are your recommendations. You may wish to filter out any that the user
has already seen but that will depend on your use case.</p>
<p>All ids for users and items are preserved as string tokens and so work as
an external key in DBs or as doc ids for search engines, they also work as
tokens for search queries.</p>
-<h2 id="references">References</h2>
+<h2 id="references">References<a class="headerlink" href="#references"
title="Permanent link">¶</a></h2>
<ol>
<li>A free ebook, which talks about the general idea: <a
href="https://www.mapr.com/practical-machine-learning">Practical Machine
Learning</a></li>
<li>A slide deck, which talks about mixing actions or other indicators: <a
href="http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/">Creating
a Multimodal Recommender with Mahout and a Search Engine</a></li>
@@ -278,7 +290,7 @@
and <a
href="http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/">What's
New in Recommenders: part #2</a></li>
<li>A post describing the loglikelihood ratio: <a
href="http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html">Surprise
and Coinsidense</a> LLR is used to reduce noise in the data while keeping the
calculations O(n) complexity.</li>
</ol>
-<h2 id="mahout-model-creation">Mahout Model Creation</h2>
+<h2 id="mahout-model-creation">Mahout Model Creation<a class="headerlink"
href="#mahout-model-creation" title="Permanent link">¶</a></h2>
<p>See the page describing <a
href="http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html"><em>spark-itemsimilarity</em></a>
for more details.</p>
</div>
</div>
Modified:
websites/staging/mahout/trunk/content/users/recommender/recommender-documentation.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/recommender/recommender-documentation.html
(original)
+++
websites/staging/mahout/trunk/content/users/recommender/recommender-documentation.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="RecommenderDocumentation-Overview"></a></p>
-<h2 id="overview">Overview</h2>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="RecommenderDocumentation-Overview"></a></p>
+<h2 id="overview">Overview<a class="headerlink" href="#overview"
title="Permanent link">¶</a></h2>
<p><em>This documentation concerns the non-distributed, non-Hadoop-based
recommender engine / collaborative filtering code inside Mahout. It was
formerly a separate project called "Taste" and has continued development
@@ -294,19 +306,19 @@ and flexibility.</p>
these interfaces. These are the pieces from which you will build your own
recommendation engine. That's it! </p>
<p><a name="RecommenderDocumentation-Architecture"></a></p>
-<h2 id="architecture">Architecture</h2>
+<h2 id="architecture">Architecture<a class="headerlink" href="#architecture"
title="Permanent link">¶</a></h2>
<p><img alt="doc" src="../../images/taste-architecture.png" /></p>
<p>This diagram shows the relationship between various Mahout components in a
user-based recommender. An item-based recommender system is similar except
that there are no Neighborhood algorithms involved.</p>
<p><a name="RecommenderDocumentation-Recommender"></a></p>
-<h3 id="recommender">Recommender</h3>
+<h3 id="recommender">Recommender<a class="headerlink" href="#recommender"
title="Permanent link">¶</a></h3>
<p>A Recommender is the core abstraction in Mahout. Given a DataModel, it can
produce recommendations. Applications will most likely use the
<strong>GenericUserBasedRecommender</strong> or
<strong>GenericItemBasedRecommender</strong>,
possibly decorated by <strong>CachingRecommender</strong>.</p>
<p><a name="RecommenderDocumentation-DataModel"></a></p>
-<h3 id="datamodel">DataModel</h3>
+<h3 id="datamodel">DataModel<a class="headerlink" href="#datamodel"
title="Permanent link">¶</a></h3>
<p>A <strong>DataModel</strong> is the interface to information about user
preferences. An
implementation might draw this data from any source, but a database is the
most likely source. Be sure to wrap this with a
<strong>ReloadFromJDBCDataModel</strong> to get good performance! Mahout
provides <strong>MySQLJDBCDataModel</strong>, for example, to access preference
data from a database via JDBC and MySQL. Another exists for PostgreSQL. Mahout
also provides a <strong>FileDataModel</strong>, which is fine for small
applications.</p>
@@ -324,22 +336,22 @@ users and pages in the context of recomm
is only a notion of an association, or none, between a user and pages that
have been visited.</p>
<p><a name="RecommenderDocumentation-UserSimilarity"></a></p>
-<h3 id="usersimilarity">UserSimilarity</h3>
+<h3 id="usersimilarity">UserSimilarity<a class="headerlink"
href="#usersimilarity" title="Permanent link">¶</a></h3>
<p>A <strong>UserSimilarity</strong> defines a notion of similarity between
two users. This is
a crucial part of a recommendation engine. These are attached to a
<strong>Neighborhood</strong> implementation. <strong>ItemSimilarity</strong>
is analagous, but find
similarity between items.</p>
<p><a name="RecommenderDocumentation-UserNeighborhood"></a></p>
-<h3 id="userneighborhood">UserNeighborhood</h3>
+<h3 id="userneighborhood">UserNeighborhood<a class="headerlink"
href="#userneighborhood" title="Permanent link">¶</a></h3>
<p>In a user-based recommender, recommendations are produced by finding a
"neighborhood" of similar users near a given user. A
<strong>UserNeighborhood</strong>
defines a means of determining that neighborhood — for example,
nearest 10 users. Implementations typically need a
<strong>UserSimilarity</strong> to
operate.</p>
<p><a name="RecommenderDocumentation-Examples"></a></p>
-<h2 id="examples">Examples</h2>
+<h2 id="examples">Examples<a class="headerlink" href="#examples"
title="Permanent link">¶</a></h2>
<p><a name="RecommenderDocumentation-User-basedRecommender"></a></p>
-<h3 id="user-based-recommender">User-based Recommender</h3>
+<h3 id="user-based-recommender">User-based Recommender<a class="headerlink"
href="#user-based-recommender" title="Permanent link">¶</a></h3>
<p>User-based recommenders are the "original", conventional style of
recommender systems. They can produce good recommendations when tweaked
properly; they are not necessarily the fastest recommender systems and are
@@ -378,7 +390,7 @@ algorithm:</p>
</pre></div>
-<h2 id="item-based-recommender">Item-based Recommender</h2>
+<h2 id="item-based-recommender">Item-based Recommender<a class="headerlink"
href="#item-based-recommender" title="Permanent link">¶</a></h2>
<p>We could have created an item-based recommender instead. Item-based
recommenders base recommendation not on user similarity, but on item
similarity. In theory these are about the same approach to the problem,
@@ -416,14 +428,14 @@ application, you would feed a list of pr
<p><a name="RecommenderDocumentation-Integrationwithyourapplication"></a></p>
-<h2 id="integration-with-your-application">Integration with your
application</h2>
+<h2 id="integration-with-your-application">Integration with your application<a
class="headerlink" href="#integration-with-your-application" title="Permanent
link">¶</a></h2>
<p>You can create a Recommender, as shown above, wherever you like in your
Java application, and use it. This includes simple Java applications or GUI
applications, server applications, and J2EE web applications.</p>
<p><a name="RecommenderDocumentation-Performance"></a></p>
-<h2 id="performance">Performance</h2>
+<h2 id="performance">Performance<a class="headerlink" href="#performance"
title="Permanent link">¶</a></h2>
<p><a name="RecommenderDocumentation-RuntimePerformance"></a></p>
-<h3 id="runtime-performance">Runtime Performance</h3>
+<h3 id="runtime-performance">Runtime Performance<a class="headerlink"
href="#runtime-performance" title="Permanent link">¶</a></h3>
<p>The more data you give, the better. Though Mahout is designed for
performance, you will undoubtedly run into performance issues at some
point. For best results, consider using the following command-line flags to
@@ -454,7 +466,7 @@ code and third-party code you use doesn'
<li>When using <strong>JDBCDataModel</strong>, make sure you wrap it with the
<strong>ReloadFromJDBCDataModel</strong> to load data into memory!. </li>
</ul>
<p><a
name="RecommenderDocumentation-AlgorithmPerformance:WhichOneIsBest?"></a></p>
-<h3 id="algorithm-performance-which-one-is-best">Algorithm Performance: Which
One Is Best?</h3>
+<h3 id="algorithm-performance-which-one-is-best">Algorithm Performance: Which
One Is Best?<a class="headerlink"
href="#algorithm-performance-which-one-is-best" title="Permanent
link">¶</a></h3>
<p>There is no right answer; it depends on your data, your application,
environment, and performance needs. Mahout provides the building blocks
from which you can construct the best Recommender for your application. The
@@ -481,7 +493,7 @@ not make sense. In this case, try a <em>
traditional information retrieval figures like precision and recall, which
are more meaningful.</p>
<p><a name="RecommenderDocumentation-UsefulLinks"></a></p>
-<h2 id="useful-links">Useful Links</h2>
+<h2 id="useful-links">Useful Links<a class="headerlink" href="#useful-links"
title="Permanent link">¶</a></h2>
<p>Here's a handful of research papers that I've read and found particularly
useful:</p>
<p>J.S. Breese, D. Heckerman and C. Kadie, "<a
href="http://research.microsoft.com/research/pubs/view.aspx?tr_id=166">Empirical
Analysis of Predictive Algorithms for Collaborative Filtering</a>
Modified:
websites/staging/mahout/trunk/content/users/recommender/recommender-first-timer-faq.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/recommender/recommender-first-timer-faq.html
(original)
+++
websites/staging/mahout/trunk/content/users/recommender/recommender-first-timer-faq.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="recommender-first-timer-dos-and-donts">Recommender First Timer Dos
and Don'ts</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="recommender-first-timer-dos-and-donts">Recommender First Timer Dos and
Don'ts<a class="headerlink" href="#recommender-first-timer-dos-and-donts"
title="Permanent link">¶</a></h1>
<p>Many people with an interest in recommenders arrive at Mahout since they're
building a first recommender system. Some starting questions have been
asked enough times to warrant a FAQ collecting advice and rules-of-thumb to
Modified:
websites/staging/mahout/trunk/content/users/recommender/userbased-5-minutes.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/recommender/userbased-5-minutes.html
(original)
+++
websites/staging/mahout/trunk/content/users/recommender/userbased-5-minutes.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,10 +264,21 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="creating-a-user-based-recommender-in-5-minutes">Creating a
User-Based Recommender in 5 minutes</h1>
-<h2 id="prerequisites">Prerequisites</h2>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="creating-a-user-based-recommender-in-5-minutes">Creating a User-Based
Recommender in 5 minutes<a class="headerlink"
href="#creating-a-user-based-recommender-in-5-minutes" title="Permanent
link">¶</a></h1>
+<h2 id="prerequisites">Prerequisites<a class="headerlink"
href="#prerequisites" title="Permanent link">¶</a></h2>
<p>Create a java project in your favorite IDE and make sure mahout is on the
classpath. The easiest way to accomplish this is by importing it via maven as
described on the <a href="/users/basics/quickstart.html">Quickstart</a>
page.</p>
-<h2 id="dataset">Dataset</h2>
+<h2 id="dataset">Dataset<a class="headerlink" href="#dataset" title="Permanent
link">¶</a></h2>
<p>Mahout's recommenders expect interactions between users and items as input.
The easiest way to supply such data to Mahout is in the form of a textfile,
where every line has the format <em>userID,itemID,value</em>. Here
<em>userID</em> and <em>itemID</em> refer to a particular user and a particular
item, and <em>value</em> denotes the strength of the interaction (e.g. the
rating given to a movie).</p>
<p>In this example, we'll use some made up data for simplicity. Create a file
called "dataset.csv" and copy the following example interactions into the file.
</p>
<pre>
@@ -304,7 +316,7 @@
4,18,1.0
</pre>
-<h2 id="creating-a-user-based-recommender">Creating a user-based
recommender</h2>
+<h2 id="creating-a-user-based-recommender">Creating a user-based recommender<a
class="headerlink" href="#creating-a-user-based-recommender" title="Permanent
link">¶</a></h2>
<p>Create a class called <em>SampleRecommender</em> with a main method.</p>
<p>The first thing we have to do is load the data from the file. Mahout's
recommenders use an interface called <em>DataModel</em> to handle interaction
data. You can load our made up interactions like this:</p>
<pre>
@@ -333,7 +345,7 @@ for (RecommendedItem recommendation : re
</pre>
<p>Congratulations, you have built your first recommender!</p>
-<h2 id="evaluation">Evaluation</h2>
+<h2 id="evaluation">Evaluation<a class="headerlink" href="#evaluation"
title="Permanent link">¶</a></h2>
<p>You might ask yourself, how to make sure that your recommender returns good
results. Unfortunately, the only way to be really sure about the quality is by
doing an A/B test with real users in a live system.</p>
<p>We can however try to get a feel of the quality, by statistical offline
evaluation. Just keep in mind that this does not replace a test with real
users!</p>
<p>One way to check whether the recommender returns good results is by doing a
<strong>hold-out</strong> test. We partition our dataset into two sets: a
trainingset consisting of 90% of the data and a testset consisting of 10%. Then
we train our recommender using the training set and look how well it predicts
the unknown interactions in the testset.</p>
Modified: websites/staging/mahout/trunk/content/users/sparkbindings/faq.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/sparkbindings/faq.html
(original)
+++ websites/staging/mahout/trunk/content/users/sparkbindings/faq.html Fri Apr
8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="faq-for-using-mahout-with-spark">FAQ for using Mahout with
Spark</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="faq-for-using-mahout-with-spark">FAQ for using Mahout with Spark<a
class="headerlink" href="#faq-for-using-mahout-with-spark" title="Permanent
link">¶</a></h1>
<p><strong>Q: Mahout Spark shell doesn't start; "ClassNotFound" problems or
various classpath problems.</strong></p>
<p><strong>A:</strong> So far as of the time of this writing all reported
problems starting the Spark shell in Mahout were revolving
around classpath issues one way or another. </p>
Modified: websites/staging/mahout/trunk/content/users/sparkbindings/home.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/sparkbindings/home.html
(original)
+++ websites/staging/mahout/trunk/content/users/sparkbindings/home.html Fri Apr
8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
Modified:
websites/staging/mahout/trunk/content/users/sparkbindings/play-with-shell.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/sparkbindings/play-with-shell.html
(original)
+++
websites/staging/mahout/trunk/content/users/sparkbindings/play-with-shell.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,12 +264,23 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="playing-with-mahouts-spark-shell">Playing with Mahout's Spark
Shell</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="playing-with-mahouts-spark-shell">Playing with Mahout's Spark Shell<a
class="headerlink" href="#playing-with-mahouts-spark-shell" title="Permanent
link">¶</a></h1>
<p>This tutorial will show you how to play with Mahout's scala DSL for linear
algebra and its Spark shell. <strong>Please keep in mind that this code is
still in a very early experimental stage</strong>.</p>
<p><em>(Edited for 0.10.2)</em></p>
-<h2 id="intro">Intro</h2>
+<h2 id="intro">Intro<a class="headerlink" href="#intro" title="Permanent
link">¶</a></h2>
<p>We'll use an excerpt of a publicly available <a
href="http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html">dataset about
cereals</a>. The dataset tells the protein, fat, carbohydrate and sugars (in
milligrams) contained in a set of cereals, as well as a customer rating for the
cereals. Our aim for this example is to fit a linear model which infers the
customer rating from the ingredients.</p>
-<table>
+<table class="table">
<thead>
<tr>
<th align="left">Name</th>
@@ -354,7 +366,7 @@
</tr>
</tbody>
</table>
-<h2 id="installing-mahout-spark-on-your-local-machine">Installing Mahout &
Spark on your local machine</h2>
+<h2 id="installing-mahout-spark-on-your-local-machine">Installing Mahout &
Spark on your local machine<a class="headerlink"
href="#installing-mahout-spark-on-your-local-machine" title="Permanent
link">¶</a></h2>
<p>We describe how to do a quick toy setup of Spark & Mahout on your local
machine, so that you can run this example and play with the shell. </p>
<ol>
<li>Download <a
href="http://www.apache.org/dyn/closer.cgi/spark/spark-1.1.1/spark-1.1.1.tgz">Apache
Spark 1.1.1</a> and unpack the archive file</li>
@@ -362,7 +374,7 @@
<li>Create a directory for Mahout somewhere on your machine, change to there
and checkout the master branch of Apache Mahout from GitHub <code>git clone
https://github.com/apache/mahout mahout</code></li>
<li>Change to the <code>mahout</code> directory and build mahout using
<code>mvn -DskipTests clean install</code></li>
</ol>
-<h2 id="starting-mahouts-spark-shell">Starting Mahout's Spark shell</h2>
+<h2 id="starting-mahouts-spark-shell">Starting Mahout's Spark shell<a
class="headerlink" href="#starting-mahouts-spark-shell" title="Permanent
link">¶</a></h2>
<ol>
<li>Goto the directory where you unpacked Spark and type
<code>sbin/start-all.sh</code> to locally start Spark</li>
<li>Open a browser, point it to <a
href="http://localhost:8080/">http://localhost:8080/</a> to check whether Spark
successfully started. Copy the url of the spark master at the top of the page
(it starts with <strong>spark://</strong>)</li>
@@ -374,7 +386,7 @@ export MASTER=[url of the Spark master]
you should see the shell starting and get the prompt <code>mahout></code>.
Check
<a href="http://mahout.apache.org/users/sparkbindings/faq.html">FAQ</a> for
further troubleshooting.</li>
</ol>
-<h2 id="implementation">Implementation</h2>
+<h2 id="implementation">Implementation<a class="headerlink"
href="#implementation" title="Permanent link">¶</a></h2>
<p>We'll use the shell to interactively play with the data and incrementally
implement a simple <a
href="https://en.wikipedia.org/wiki/Linear_regression">linear regression</a>
algorithm. Let's first load the dataset. Usually, we wouldn't need Mahout
unless we processed a large dataset stored in a distributed filesystem. But for
the sake of this example, we'll use our tiny toy dataset and "pretend" it was
too big to fit onto a single machine.</p>
<p><em>Note: You can incrementally follow the example by copy-and-pasting the
code into your running Mahout shell.</em></p>
<p>Mahout's linear algebra DSL has an abstraction called
<em>DistributedRowMatrix (DRM)</em> which models a matrix that is partitioned
by rows and stored in the memory of a cluster of machines. We use
<code>dense()</code> to create a dense in-memory matrix from our toy dataset
and use <code>drmParallelize</code> to load it into the cluster, "mimicking" a
large, partitioned dataset.</p>