Modified:
websites/staging/mahout/trunk/content/users/classification/support-vector-machines.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/support-vector-machines.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/support-vector-machines.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="SupportVectorMachines-SupportVectorMachines"></a></p>
-<h1 id="support-vector-machines">Support Vector Machines</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="SupportVectorMachines-SupportVectorMachines"></a></p>
+<h1 id="support-vector-machines">Support Vector Machines<a class="headerlink"
href="#support-vector-machines" title="Permanent link">¶</a></h1>
<p>As with Naive Bayes, Support Vector Machines (or SVMs in short) can be used
to solve the task of assigning objects to classes. However, the way this
task is solved is completely different to the setting in Naive Bayes.</p>
@@ -291,9 +303,9 @@ solutions. Each separating hyperplane ne
training examples. In addition, that way, the solution may be based on the
information encoded in only very few examples.</p>
<p><a name="SupportVectorMachines-Strategyforparallelization"></a></p>
-<h2 id="strategy-for-parallelization">Strategy for parallelization</h2>
+<h2 id="strategy-for-parallelization">Strategy for parallelization<a
class="headerlink" href="#strategy-for-parallelization" title="Permanent
link">¶</a></h2>
<p><a name="SupportVectorMachines-Designofpackages"></a></p>
-<h2 id="design-of-packages">Design of packages</h2>
+<h2 id="design-of-packages">Design of packages<a class="headerlink"
href="#design-of-packages" title="Permanent link">¶</a></h2>
</div>
</div>
</div>
Modified:
websites/staging/mahout/trunk/content/users/classification/twenty-newsgroups.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/twenty-newsgroups.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/twenty-newsgroups.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,10 +264,21 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a
name="TwentyNewsgroups-TwentyNewsgroupsClassificationExample"></a></p>
-<h2 id="twenty-newsgroups-classification-example">Twenty Newsgroups
Classification Example</h2>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="TwentyNewsgroups-TwentyNewsgroupsClassificationExample"></a></p>
+<h2 id="twenty-newsgroups-classification-example">Twenty Newsgroups
Classification Example<a class="headerlink"
href="#twenty-newsgroups-classification-example" title="Permanent
link">¶</a></h2>
<p><a name="TwentyNewsgroups-Introduction"></a></p>
-<h2 id="introduction">Introduction</h2>
+<h2 id="introduction">Introduction<a class="headerlink" href="#introduction"
title="Permanent link">¶</a></h2>
<p>The 20 newsgroups dataset is a collection of approximately 20,000
newsgroup documents, partitioned (nearly) evenly across 20 different
newsgroups. The 20 newsgroups collection has become a popular data set for
@@ -275,7 +287,7 @@ text classification and text clustering.
classifier to create a model that would classify a new document into one of
the 20 newsgroups.</p>
<p><a name="TwentyNewsgroups-Prerequisites"></a></p>
-<h3 id="prerequisites">Prerequisites</h3>
+<h3 id="prerequisites">Prerequisites<a class="headerlink"
href="#prerequisites" title="Permanent link">¶</a></h3>
<ul>
<li>Mahout has been downloaded (<a
href="https://mahout.apache.org/general/downloads.html">instructions
here</a>)</li>
<li>Maven is available</li>
@@ -286,7 +298,7 @@ the 20 newsgroups.</p>
</li>
</ul>
<p><a name="TwentyNewsgroups-Instructionsforrunningtheexample"></a></p>
-<h3 id="instructions-for-running-the-example">Instructions for running the
example</h3>
+<h3 id="instructions-for-running-the-example">Instructions for running the
example<a class="headerlink" href="#instructions-for-running-the-example"
title="Permanent link">¶</a></h3>
<ol>
<li>
<p>If running Hadoop in cluster mode, start the hadoop daemons by executing
the following commands:</p>
@@ -372,7 +384,7 @@ Reliability <span class="p">(</span>stan
<p><a name="TwentyNewsgroups-ComplementaryNaiveBayes"></a></p>
-<h2 id="end-to-end-commands-to-build-a-cbayes-model-for-20-newsgroups">End to
end commands to build a CBayes model for 20 newsgroups</h2>
+<h2 id="end-to-end-commands-to-build-a-cbayes-model-for-20-newsgroups">End to
end commands to build a CBayes model for 20 newsgroups<a class="headerlink"
href="#end-to-end-commands-to-build-a-cbayes-model-for-20-newsgroups"
title="Permanent link">¶</a></h2>
<p>The <a
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh">20
newsgroups example script</a> issues the following commands as outlined above.
We can build a CBayes classifier from the command line by following the process
in the script: </p>
<p><em>Be sure that <strong>MAHOUT_HOME</strong>/bin and
<strong>HADOOP_HOME</strong>/bin are in your <strong>$PATH</strong></em></p>
<ol>
@@ -396,9 +408,7 @@ Reliability <span class="p">(</span>stan
<ul>
-<li>
-<p>If you're running on a Hadoop cluster:</p>
-<div class="codehilite"><pre>$ hadoop dfs -put <span class="cp">${</span><span
class="n">WORK_DIR</span><span class="cp">}</span>/20news-all <span
class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-all
+<li>If you're running on a Hadoop cluster:<div class="codehilite"><pre>$
hadoop dfs -put <span class="cp">${</span><span class="n">WORK_DIR</span><span
class="cp">}</span>/20news-all <span class="cp">${</span><span
class="n">WORK_DIR</span><span class="cp">}</span>/20news-all
</pre></div>
Modified:
websites/staging/mahout/trunk/content/users/classification/wikipedia-classifier-example.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/wikipedia-classifier-example.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/wikipedia-classifier-example.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,11 +264,22 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="wikipedia-xml-parser-and-naive-bayes-classifier-example">Wikipedia
XML parser and Naive Bayes Classifier Example</h1>
-<h2 id="introduction">Introduction</h2>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="wikipedia-xml-parser-and-naive-bayes-classifier-example">Wikipedia XML
parser and Naive Bayes Classifier Example<a class="headerlink"
href="#wikipedia-xml-parser-and-naive-bayes-classifier-example"
title="Permanent link">¶</a></h1>
+<h2 id="introduction">Introduction<a class="headerlink" href="#introduction"
title="Permanent link">¶</a></h2>
<p>Mahout has an <a
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh">example
script</a> [1] which will download a recent XML dump of the (entire if
desired) <a href="http://dumps.wikimedia.org/enwiki/latest/">English Wikipedia
database</a>. After running the classification script, you can use the <a
href="https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala">document
classification script</a> from the Mahout <a
href="http://mahout.apache.org/users/sparkbindings/play-with-shell.html">spark-shell</a>
to vectorize and classify text from outside of the training and testing corpus
using a modle built on the Wikipedia dataset. </p>
<p>You can run this script to build and test a Naive Bayes classifier for
option (1) 10 arbitrary countries or option (2) 2 countries (United States and
United Kingdom).</p>
-<h2 id="oververview">Oververview</h2>
+<h2 id="oververview">Oververview<a class="headerlink" href="#oververview"
title="Permanent link">¶</a></h2>
<p>Tou run the example simply execute the
<code>$MAHOUT_HOME/examples/bin/classify-wikipedia.sh</code> script.</p>
<p>By defult the script is set to run on a medium sized Wikipedia XML dump.
To run on the full set (the entire english Wikipedia) you can change the
download by commenting out line 78, and uncommenting line 80 of <a
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh">classify-wikipedia.sh</a>
[1]. However this is not recommended unless you have the resources to do so.
<em>Be sure to clean your work directory when changing datasets- option
(3).</em></p>
<p>The step by step process for Creating a Naive Bayes Classifier for the
Wikipedia XML dump is very similar to that for <a
href="http://mahout.apache.org/users/classification/twenty-newsgroups.html">creating
a 20 Newsgroups Classifier</a> [4]. The only difference being that instead of
running <code>$mahout seqdirectory</code> on the unzipped 20 Newsgroups file,
you'll run <code>$mahout seqwiki</code> on the unzipped Wikipedia xml dump.</p>
@@ -290,7 +302,7 @@ directory: country.txt, country10.txt a
<p>After <code>seqwiki</code>, the script runs <code>seq2sparse</code>,
<code>split</code>, <code>trainnb</code> and <code>testnb</code> as in the <a
href="http://mahout.apache.org/users/classification/twenty-newsgroups.html">step
by step 20newsgroups example</a>. When all of the jobs have finished, a
confusion matrix will be displayed.</p>
-<h1 id="resourcese">Resourcese</h1>
+<h1 id="resourcese">Resourcese<a class="headerlink" href="#resourcese"
title="Permanent link">¶</a></h1>
<p>[1] <a
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh">classify-wikipedia.sh</a></p>
<p>[2] <a
href="https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala">Document
classification script for the Mahout Spark Shell</a></p>
<p>[3] <a
href="https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt">Example
category file</a></p>
Modified:
websites/staging/mahout/trunk/content/users/clustering/20newsgroups.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/clustering/20newsgroups.html
(original)
+++ websites/staging/mahout/trunk/content/users/clustering/20newsgroups.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="20Newsgroups-NaiveBayesusing20NewsgroupsData"></a></p>
-<h1 id="naive-bayes-using-20-newsgroups-data">Naive Bayes using 20 Newsgroups
Data</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="20Newsgroups-NaiveBayesusing20NewsgroupsData"></a></p>
+<h1 id="naive-bayes-using-20-newsgroups-data">Naive Bayes using 20 Newsgroups
Data<a class="headerlink" href="#naive-bayes-using-20-newsgroups-data"
title="Permanent link">¶</a></h1>
<p>See <a
href="https://issues.apache.org/jira/browse/MAHOUT-9">https://issues.apache.org/jira/browse/MAHOUT-9</a></p>
</div>
</div>
Modified:
websites/staging/mahout/trunk/content/users/clustering/canopy-clustering.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/canopy-clustering.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/canopy-clustering.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="CanopyClustering-CanopyClustering"></a></p>
-<h1 id="canopy-clustering">Canopy Clustering</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="CanopyClustering-CanopyClustering"></a></p>
+<h1 id="canopy-clustering">Canopy Clustering<a class="headerlink"
href="#canopy-clustering" title="Permanent link">¶</a></h1>
<p><a href="http://www.kamalnigam.com/papers/canopy-kdd00.pdf">Canopy
Clustering</a>
is a very simple, fast and surprisingly accurate method for grouping
objects into clusters. All objects are represented as a point in a
@@ -285,7 +297,7 @@ distance measurements can be significant
outside of the initial canopies.</p>
<p><strong>WARNING</strong>: Canopy is deprecated in the latest release and
will be removed once streaming k-means becomes stable enough.</p>
<p><a name="CanopyClustering-Strategyforparallelization"></a></p>
-<h2 id="strategy-for-parallelization">Strategy for parallelization</h2>
+<h2 id="strategy-for-parallelization">Strategy for parallelization<a
class="headerlink" href="#strategy-for-parallelization" title="Permanent
link">¶</a></h2>
<p>Looking at the sample Hadoop implementation in <a
href="http://code.google.com/p/canopy-clustering/">http://code.google.com/p/canopy-clustering/</a>
the processing is done in 3 M/R steps:
1. The data is massaged into suitable input format
@@ -299,13 +311,13 @@ centers
. Finally here is the <a
href="http://en.wikipedia.org/wiki/Canopy_clustering_algorithm">Wikipedia
page</a>
.</p>
<p><a name="CanopyClustering-Designofimplementation"></a></p>
-<h2 id="design-of-implementation">Design of implementation</h2>
+<h2 id="design-of-implementation">Design of implementation<a
class="headerlink" href="#design-of-implementation" title="Permanent
link">¶</a></h2>
<p>The implementation accepts as input Hadoop SequenceFiles containing
multidimensional points (VectorWritable). Points may be expressed either as
dense or sparse Vectors and processing is done in two phases: Canopy
generation and, optionally, Clustering.</p>
<p><a name="CanopyClustering-Canopygenerationphase"></a></p>
-<h3 id="canopy-generation-phase">Canopy generation phase</h3>
+<h3 id="canopy-generation-phase">Canopy generation phase<a class="headerlink"
href="#canopy-generation-phase" title="Permanent link">¶</a></h3>
<p>During the map step, each mapper processes a subset of the total points and
applies the chosen distance measure and thresholds to generate canopies. In
the mapper, each point which is found to be within an existing canopy will
@@ -318,7 +330,7 @@ final set of canopy centroids which is o
centroids). The reducer output format is: SequenceFile(Text, Canopy) with
the <em>key</em> encoding the canopy identifier. </p>
<p><a name="CanopyClustering-Clusteringphase"></a></p>
-<h3 id="clustering-phase">Clustering phase</h3>
+<h3 id="clustering-phase">Clustering phase<a class="headerlink"
href="#clustering-phase" title="Permanent link">¶</a></h3>
<p>During the clustering phase, each mapper reads the Canopies produced by the
first phase. Since all mappers have the same canopy definitions, their
outputs will be combined during the shuffle so that each reducer (many are
@@ -329,7 +341,7 @@ WeightedVectorWritable has two fields: a
vector. Together they encode the probability that each vector is a member
of the given canopy.</p>
<p><a name="CanopyClustering-RunningCanopyClustering"></a></p>
-<h2 id="running-canopy-clustering">Running Canopy Clustering</h2>
+<h2 id="running-canopy-clustering">Running Canopy Clustering<a
class="headerlink" href="#running-canopy-clustering" title="Permanent
link">¶</a></h2>
<p>The canopy clustering algorithm may be run using a command-line invocation
on CanopyDriver.main or by making a Java call to CanopyDriver.run(...).
Both require several arguments:</p>
@@ -390,7 +402,7 @@ clustering, the weights are computed as
is between the cluster center and the vector using the chosen
DistanceMeasure.</p>
<p><a name="CanopyClustering-Examples"></a></p>
-<h1 id="examples">Examples</h1>
+<h1 id="examples">Examples<a class="headerlink" href="#examples"
title="Permanent link">¶</a></h1>
<p>The following images illustrate Canopy clustering applied to a set of
randomly-generated 2-d data points. The points are generated using a normal
distribution centered at a mean location and with a constant standard
Modified:
websites/staging/mahout/trunk/content/users/clustering/canopy-commandline.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/canopy-commandline.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/canopy-commandline.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a
name="canopy-commandline-RunningCanopyClusteringfromtheCommandLine"></a></p>
-<h1 id="running-canopy-clustering-from-the-command-line">Running Canopy
Clustering from the Command Line</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a
name="canopy-commandline-RunningCanopyClusteringfromtheCommandLine"></a></p>
+<h1 id="running-canopy-clustering-from-the-command-line">Running Canopy
Clustering from the Command Line<a class="headerlink"
href="#running-canopy-clustering-from-the-command-line" title="Permanent
link">¶</a></h1>
<p>Mahout's Canopy clustering can be launched from the same command line
invocation whether you are running on a single machine in stand-alone mode
or on a larger Hadoop cluster. The difference is determined by the
@@ -283,7 +295,7 @@ the Mahout version number. For example,
job will be mahout-core-0.3.job</li>
</ul>
<p><a name="canopy-commandline-Testingitononesinglemachinew/ocluster"></a></p>
-<h2 id="testing-it-on-one-single-machine-wo-cluster">Testing it on one single
machine w/o cluster</h2>
+<h2 id="testing-it-on-one-single-machine-wo-cluster">Testing it on one single
machine w/o cluster<a class="headerlink"
href="#testing-it-on-one-single-machine-wo-cluster" title="Permanent
link">¶</a></h2>
<ul>
<li>Put the data: cp <PATH TO DATA> testdata</li>
<li>
@@ -293,7 +305,7 @@ org.apache.mahout.common.distance.Cosine
</li>
</ul>
<p><a name="canopy-commandline-Runningitonthecluster"></a></p>
-<h2 id="running-it-on-the-cluster">Running it on the cluster</h2>
+<h2 id="running-it-on-the-cluster">Running it on the cluster<a
class="headerlink" href="#running-it-on-the-cluster" title="Permanent
link">¶</a></h2>
<ul>
<li>(As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh</li>
<li>Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata</li>
@@ -310,7 +322,7 @@ to view all outputs.</p>
</li>
</ul>
<p><a name="canopy-commandline-Commandlineoptions"></a></p>
-<h1 id="command-line-options">Command line options</h1>
+<h1 id="command-line-options">Command line options<a class="headerlink"
href="#command-line-options" title="Permanent link">¶</a></h1>
<div class="codehilite"><pre> <span class="o">--</span><span
class="n">input</span> <span class="p">(</span><span class="o">-</span><span
class="nb">i</span><span class="p">)</span> <span class="n">input</span>
<span class="n">Path</span> <span class="n">to</span> <span
class="n">job</span> <span class="n">input</span> <span
class="n">directory</span><span class="p">.</span><span class="n">Must</span>
<span class="n">be</span> <span class="n">a</span>
<span class="n">SequenceFile</span> <span class="n">of</span>
<span class="n">VectorWritable</span>
Modified:
websites/staging/mahout/trunk/content/users/clustering/cluster-dumper.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/clustering/cluster-dumper.html
(original)
+++ websites/staging/mahout/trunk/content/users/clustering/cluster-dumper.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,14 +264,25 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="ClusterDumper-Introduction"></a></p>
-<h2 id="cluster-dumper-introduction">Cluster Dumper - Introduction</h2>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="ClusterDumper-Introduction"></a></p>
+<h2 id="cluster-dumper-introduction">Cluster Dumper - Introduction<a
class="headerlink" href="#cluster-dumper-introduction" title="Permanent
link">¶</a></h2>
<p>Clustering tasks in Mahout will output data in the format of a SequenceFile
(Text, Cluster) and the Text is a cluster identifier string. To analyze
this output we need to convert the sequence files to a human readable
format and this is achieved using the clusterdump utility.</p>
<p><a
name="ClusterDumper-Stepsforanalyzingclusteroutputusingclusterdumputility"></a></p>
-<h2 id="steps-for-analyzing-cluster-output-using-clusterdump-utility">Steps
for analyzing cluster output using clusterdump utility</h2>
+<h2 id="steps-for-analyzing-cluster-output-using-clusterdump-utility">Steps
for analyzing cluster output using clusterdump utility<a class="headerlink"
href="#steps-for-analyzing-cluster-output-using-clusterdump-utility"
title="Permanent link">¶</a></h2>
<p>After you've executed a clustering tasks (either examples or real-world),
you can run clusterdumper in 2 modes:</p>
<ol>
@@ -278,7 +290,7 @@ you can run clusterdumper in 2 modes:</p
<li>Standalone Java Program </li>
</ol>
<p><a name="ClusterDumper-HadoopEnvironment{anchor:HadoopEnvironment}"></a></p>
-<h3 id="hadoop-environment">Hadoop Environment</h3>
+<h3 id="hadoop-environment">Hadoop Environment<a class="headerlink"
href="#hadoop-environment" title="Permanent link">¶</a></h3>
<p>If you have setup your HADOOP_HOME environment variable, you can use the
command line utility <code>mahout</code> to execute the ClusterDumper on
Hadoop. In
this case we wont need to get the output clusters to our local machines.
@@ -286,7 +298,7 @@ The utility will read the output cluster
human-readable cluster values into our local file system. Say you've just
executed the <a href="clustering-of-synthetic-control-data.html">synthetic
control example </a>
and want to analyze the output, you can execute the <code>mahout
clusterdumper</code> utility from the command line.</p>
-<h4 id="cli-options">CLI options:</h4>
+<h4 id="cli-options">CLI options:<a class="headerlink" href="#cli-options"
title="Permanent link">¶</a></h4>
<div class="codehilite"><pre><span class="o">--</span><span
class="n">help</span> <span
class="n">Print</span> <span class="n">out</span> <span class="n">help</span>
<span class="o">--</span><span class="n">input</span> <span
class="p">(</span><span class="o">-</span><span class="nb">i</span><span
class="p">)</span> <span class="n">input</span> <span
class="n">The</span> <span class="n">directory</span> <span
class="n">containing</span> <span class="n">Sequence</span>
<span class="n">Files</span> <span
class="k">for</span> <span class="n">the</span> <span class="n">Clusters</span>
@@ -316,7 +328,7 @@ executed the <a href="clustering-of-synt
</pre></div>
-<h3 id="standalone-java-program">Standalone Java Program</h3>
+<h3 id="standalone-java-program">Standalone Java Program<a class="headerlink"
href="#standalone-java-program" title="Permanent link">¶</a></h3>
<p>Run the clusterdump utility as follows as a standalone Java Program through
Eclipse. <!-- - if you are using eclipse, setup mahout-utils as a project as
specified in <a href="../../developers/buildingmahout.html">Working with Maven
in Eclipse</a>. -->
To execute ClusterDumper.java,</p>
<ul>
Modified:
websites/staging/mahout/trunk/content/users/clustering/clustering-of-synthetic-control-data.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/clustering-of-synthetic-control-data.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/clustering-of-synthetic-control-data.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,12 +264,23 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="clustering-synthetic-control-data">Clustering synthetic control
data</h1>
-<h2 id="introduction">Introduction</h2>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="clustering-synthetic-control-data">Clustering synthetic control data<a
class="headerlink" href="#clustering-synthetic-control-data" title="Permanent
link">¶</a></h1>
+<h2 id="introduction">Introduction<a class="headerlink" href="#introduction"
title="Permanent link">¶</a></h2>
<p>This example will demonstrate clustering of time series data, specifically
control charts. <a href="http://en.wikipedia.org/wiki/Control_chart">Control
charts</a> are tools used to determine whether a manufacturing or business
process is in a state of statistical control. Such control charts are generated
/ simulated repeatedly at equal time intervals. A <a
href="http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html">simulated
dataset</a> is available for use in UCI machine learning repository.</p>
<p>A time series of control charts needs to be clustered into their close knit
groups. The data set we use is synthetic and is meant to resemble real world
information in an anonymized format. It contains six different classes: Normal,
Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift. In
this example we will use Mahout to cluster the data into corresponding class
buckets. </p>
<p><em>For the sake of simplicity, we won't use a cluster in this example, but
instead show you the commands to run the clustering examples locally with
Hadoop</em>.</p>
-<h2 id="setup">Setup</h2>
+<h2 id="setup">Setup<a class="headerlink" href="#setup" title="Permanent
link">¶</a></h2>
<p>We need to do some initial setup before we are able to run the example. </p>
<ol>
<li>
@@ -287,7 +299,7 @@
<p>Create a folder called <em>testdata</em> in the current directory and copy
the dataset into this folder.</p>
</li>
</ol>
-<h2 id="clustering-examples">Clustering Examples</h2>
+<h2 id="clustering-examples">Clustering Examples<a class="headerlink"
href="#clustering-examples" title="Permanent link">¶</a></h2>
<p>Depending on the clustering algorithm you want to run, the following
commands can be used:</p>
<ul>
<li>
Modified:
websites/staging/mahout/trunk/content/users/clustering/clustering-seinfeld-episodes.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/clustering-seinfeld-episodes.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/clustering-seinfeld-episodes.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p>Below is short tutorial on how to cluster Seinfeld episode transcripts
with
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>Below is short tutorial on how to cluster Seinfeld episode transcripts with
Mahout.</p>
<p>http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/</p>
</div>
Modified:
websites/staging/mahout/trunk/content/users/clustering/clusteringyourdata.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/clusteringyourdata.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/clusteringyourdata.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,16 +264,27 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="clustering-your-data">Clustering your data</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="clustering-your-data">Clustering your data<a class="headerlink"
href="#clustering-your-data" title="Permanent link">¶</a></h1>
<p>After you've done the <a href="quickstart.html">Quickstart</a> and are
familiar with the basics of Mahout, it is time to cluster your own
data. See also <a href="en.wikipedia.org/wiki/Cluster_analysis">Wikipedia on
cluster analysis</a> for more background.</p>
<p>The following pieces <em>may</em> be useful for in getting started:</p>
<p><a name="ClusteringYourData-Input"></a></p>
-<h1 id="input">Input</h1>
+<h1 id="input">Input<a class="headerlink" href="#input" title="Permanent
link">¶</a></h1>
<p>For starters, you will need your data in an appropriate Vector format, see
<a href="../basics/creating-vectors.html">Creating Vectors</a>.
In particular for text preparation check out <a
href="../basics/creating-vectors-from-text.html">Creating Vectors from
Text</a>.</p>
<p><a name="ClusteringYourData-RunningtheProcess"></a></p>
-<h1 id="running-the-process">Running the Process</h1>
+<h1 id="running-the-process">Running the Process<a class="headerlink"
href="#running-the-process" title="Permanent link">¶</a></h1>
<ul>
<li>
<p><a href="canopy-clustering.html">Canopy background</a> and <a
href="canopy-commandline.html">canopy-commandline</a>.</p>
@@ -295,14 +307,14 @@ In particular for text preparation check
</li>
</ul>
<p><a name="ClusteringYourData-RetrievingtheOutput"></a></p>
-<h1 id="retrieving-the-output">Retrieving the Output</h1>
+<h1 id="retrieving-the-output">Retrieving the Output<a class="headerlink"
href="#retrieving-the-output" title="Permanent link">¶</a></h1>
<p>Mahout has a cluster dumper utility that can be used to retrieve and
evaluate your clustering data.</p>
<div class="codehilite"><pre><span class="o">./</span><span
class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span
class="n">clusterdump</span> <span class="o"><</span><span
class="n">OPTIONS</span><span class="o">></span>
</pre></div>
<p><a name="ClusteringYourData-Theclusterdumperoptionsare:"></a></p>
-<h2 id="the-cluster-dumper-options-are">The cluster dumper options are:</h2>
+<h2 id="the-cluster-dumper-options-are">The cluster dumper options are:<a
class="headerlink" href="#the-cluster-dumper-options-are" title="Permanent
link">¶</a></h2>
<div class="codehilite"><pre> <span class="o">--</span><span
class="n">help</span> <span class="p">(</span><span class="o">-</span><span
class="n">h</span><span class="p">)</span> <span
class="n">Print</span> <span class="n">out</span> <span class="n">help</span>
<span class="o">--</span><span class="n">input</span> <span
class="p">(</span><span class="o">-</span><span class="nb">i</span><span
class="p">)</span> <span class="n">input</span> <span
class="n">The</span> <span class="n">directory</span> <span
class="n">containing</span> <span class="n">Sequence</span>
@@ -346,7 +358,7 @@ In particular for text preparation check
<p>More information on using clusterdump utility can be found <a
href="cluster-dumper.html">here</a></p>
<p><a name="ClusteringYourData-ValidatingtheOutput"></a></p>
-<h1 id="validating-the-output">Validating the Output</h1>
+<h1 id="validating-the-output">Validating the Output<a class="headerlink"
href="#validating-the-output" title="Permanent link">¶</a></h1>
<p>{quote}
Ted Dunning: A principled approach to cluster evaluation is to measure how
well the
cluster membership captures the structure of unseen data. A natural
@@ -369,12 +381,11 @@ data.</p>
<p>For text, you can actually compute perplexity which measures how well
cluster membership predicts what words are used. This is nice because you
don't have to worry about the entropy of real valued numbers.</p>
-<p>Manual inspection and the so-called laugh test is also important. The idea
+<p quote="quote">Manual inspection and the so-called laugh test is also
important. The idea
is that the results should not be so ludicrous as to make you laugh.
Unfortunately, it is pretty easy to kid yourself into thinking your system
is working using this kind of inspection. The problem is that we are too
-good at seeing (making up) patterns.
-{quote}</p>
+good at seeing (making up) patterns.</p>
</div>
</div>
</div>
Modified:
websites/staging/mahout/trunk/content/users/clustering/expectation-maximization.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/expectation-maximization.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/expectation-maximization.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="ExpectationMaximization-ExpectationMaximization"></a></p>
-<h1 id="expectation-maximization">Expectation Maximization</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="ExpectationMaximization-ExpectationMaximization"></a></p>
+<h1 id="expectation-maximization">Expectation Maximization<a
class="headerlink" href="#expectation-maximization" title="Permanent
link">¶</a></h1>
<p>The principle of EM can be applied to several learning settings, but is
most commonly associated with clustering. The main principle of the
algorithm is comparable to k-Means. Yet in contrast to hard cluster
@@ -272,7 +284,7 @@ assignments, each object is given some p
Accordingly cluster centers are recomputed based on the average of all
objects weighted by their probability of belonging to the cluster at hand.</p>
<p><a name="ExpectationMaximization-Canopy-modifiedEM"></a></p>
-<h2 id="canopy-modified-em">Canopy-modified EM</h2>
+<h2 id="canopy-modified-em">Canopy-modified EM<a class="headerlink"
href="#canopy-modified-em" title="Permanent link">¶</a></h2>
<p>One can also use the canopies idea to speed up prototypebased clustering
methods like K-means and Expectation-Maximization (EM). In general, neither
K-means nor EMspecify how many clusters to use. The canopies technique does
@@ -306,9 +318,9 @@ iterative step (apart from the enormous
fewer terms) will be negligible since points outside the canopy will have
exponentially small influence.</p>
<p><a name="ExpectationMaximization-StrategyforParallelization"></a></p>
-<h2 id="strategy-for-parallelization">Strategy for Parallelization</h2>
+<h2 id="strategy-for-parallelization">Strategy for Parallelization<a
class="headerlink" href="#strategy-for-parallelization" title="Permanent
link">¶</a></h2>
<p><a name="ExpectationMaximization-Map/ReduceImplementation"></a></p>
-<h2 id="mapreduce-implementation">Map/Reduce Implementation</h2>
+<h2 id="mapreduce-implementation">Map/Reduce Implementation<a
class="headerlink" href="#mapreduce-implementation" title="Permanent
link">¶</a></h2>
</div>
</div>
</div>
Modified:
websites/staging/mahout/trunk/content/users/clustering/fuzzy-k-means-commandline.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/fuzzy-k-means-commandline.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/fuzzy-k-means-commandline.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a
name="fuzzy-k-means-commandline-RunningFuzzyk-MeansClusteringfromtheCommandLine"></a></p>
-<h1 id="running-fuzzy-k-means-clustering-from-the-command-line">Running Fuzzy
k-Means Clustering from the Command Line</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a
name="fuzzy-k-means-commandline-RunningFuzzyk-MeansClusteringfromtheCommandLine"></a></p>
+<h1 id="running-fuzzy-k-means-clustering-from-the-command-line">Running Fuzzy
k-Means Clustering from the Command Line<a class="headerlink"
href="#running-fuzzy-k-means-clustering-from-the-command-line" title="Permanent
link">¶</a></h1>
<p>Mahout's Fuzzy k-Means clustering can be launched from the same command
line invocation whether you are running on a single machine in stand-alone
mode or on a larger Hadoop cluster. The difference is determined by the
@@ -283,7 +295,7 @@ the Mahout version number. For example,
job will be mahout-core-0.3.job</li>
</ul>
<p><a
name="fuzzy-k-means-commandline-Testingitononesinglemachinew/ocluster"></a></p>
-<h2 id="testing-it-on-one-single-machine-wo-cluster">Testing it on one single
machine w/o cluster</h2>
+<h2 id="testing-it-on-one-single-machine-wo-cluster">Testing it on one single
machine w/o cluster<a class="headerlink"
href="#testing-it-on-one-single-machine-wo-cluster" title="Permanent
link">¶</a></h2>
<ul>
<li>Put the data: cp <PATH TO DATA> testdata</li>
<li>
@@ -292,7 +304,7 @@ job will be mahout-core-0.3.job</li>
</li>
</ul>
<p><a name="fuzzy-k-means-commandline-Runningitonthecluster"></a></p>
-<h2 id="running-it-on-the-cluster">Running it on the cluster</h2>
+<h2 id="running-it-on-the-cluster">Running it on the cluster<a
class="headerlink" href="#running-it-on-the-cluster" title="Permanent
link">¶</a></h2>
<ul>
<li>(As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh</li>
<li>Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata</li>
@@ -308,7 +320,7 @@ to view all outputs.</p>
</li>
</ul>
<p><a name="fuzzy-k-means-commandline-Commandlineoptions"></a></p>
-<h1 id="command-line-options">Command line options</h1>
+<h1 id="command-line-options">Command line options<a class="headerlink"
href="#command-line-options" title="Permanent link">¶</a></h1>
<div class="codehilite"><pre> <span class="o">--</span><span
class="n">input</span> <span class="p">(</span><span class="o">-</span><span
class="nb">i</span><span class="p">)</span> <span class="n">input</span>
<span class="n">Path</span> <span class="n">to</span> <span
class="n">job</span> <span class="n">input</span> <span
class="n">directory</span><span class="p">.</span>
<span class="n">Must</span> <span
class="n">be</span> <span class="n">a</span> <span
class="n">SequenceFile</span> <span class="n">of</span>
<span class="n">VectorWritable</span>
Modified:
websites/staging/mahout/trunk/content/users/clustering/fuzzy-k-means.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/clustering/fuzzy-k-means.html
(original)
+++ websites/staging/mahout/trunk/content/users/clustering/fuzzy-k-means.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="fuzzy-k-means">Fuzzy K-Means</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="fuzzy-k-means">Fuzzy K-Means<a class="headerlink"
href="#fuzzy-k-means" title="Permanent link">¶</a></h1>
<p>Fuzzy K-Means (also called Fuzzy C-Means) is an extension of <a
href="http://mahout.apache.org/users/clustering/k-means-clustering.html">K-Means</a>
, the popular simple clustering technique. While K-Means discovers hard
clusters (a point belong to only one cluster), Fuzzy K-Means is a more
@@ -271,7 +283,7 @@ statistically formalized method and disc
particular point can belong to more than one cluster with certain
probability.</p>
<p><a name="FuzzyK-Means-Algorithm"></a></p>
-<h4 id="algorithm">Algorithm</h4>
+<h4 id="algorithm">Algorithm<a class="headerlink" href="#algorithm"
title="Permanent link">¶</a></h4>
<p>Like K-Means, Fuzzy K-Means works on those objects which can be represented
in n-dimensional vector space and a distance measure is defined.
The algorithm is similar to k-means.</p>
@@ -284,7 +296,7 @@ The algorithm is similar to k-means.</p>
</li>
</ul>
<p><a name="FuzzyK-Means-DesignImplementation"></a></p>
-<h4 id="design-implementation">Design Implementation</h4>
+<h4 id="design-implementation">Design Implementation<a class="headerlink"
href="#design-implementation" title="Permanent link">¶</a></h4>
<p>The design is similar to K-Means present in Mahout. It accepts an input
file containing vector points. User can either provide the cluster centers
as input or can allow canopy algorithm to run and create initial clusters.</p>
@@ -320,7 +332,7 @@ identifier (e.g. "C14". Output value is:
"C14"). The reducer encodes unconverged clusters with a 'Cn' cluster Id and
converged clusters with 'Vn' clusterId.</p>
<p><a name="FuzzyK-Means-RunningFuzzyk-MeansClustering"></a></p>
-<h2 id="running-fuzzy-k-means-clustering">Running Fuzzy k-Means Clustering</h2>
+<h2 id="running-fuzzy-k-means-clustering">Running Fuzzy k-Means Clustering<a
class="headerlink" href="#running-fuzzy-k-means-clustering" title="Permanent
link">¶</a></h2>
<p>The Fuzzy k-Means clustering algorithm may be run using a command-line
invocation on FuzzyKMeansDriver.main or by making a Java call to
FuzzyKMeansDriver.run(). </p>
@@ -389,7 +401,7 @@ double <em>weight</em> and a VectorWrita
computed as 1/(1+distance) where the distance is between the cluster center
and the vector using the chosen DistanceMeasure. </p>
<p><a name="FuzzyK-Means-Examples"></a></p>
-<h1 id="examples">Examples</h1>
+<h1 id="examples">Examples<a class="headerlink" href="#examples"
title="Permanent link">¶</a></h1>
<p>The following images illustrate Fuzzy k-Means clustering applied to a set
of randomly-generated 2-d data points. The points are generated using a
normal distribution centered at a mean location and with a constant
@@ -416,7 +428,7 @@ data set which is generated using asymme
Fuzzy k-Means does a fair job handling this data set as well.</p>
<p><img alt="fuzzy" src="../../images/2dFuzzyKMeans.png" /></p>
<p><a name="FuzzyK-Means-References "></a></p>
-<h4 id="referenceswzxhzdk15">References </h4>
+<h4 id="references">References <a class="headerlink" href="#references"
title="Permanent link">¶</a></h4>
<ul>
<li><a
href="http://en.wikipedia.org/wiki/Fuzzy_clustering">http://en.wikipedia.org/wiki/Fuzzy_clustering</a></li>
</ul>
Modified:
websites/staging/mahout/trunk/content/users/clustering/hierarchical-clustering.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/hierarchical-clustering.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/hierarchical-clustering.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p>Hierarchical clustering is the process or finding bigger clusters, and
also
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>Hierarchical clustering is the process or finding bigger clusters, and also
the smaller clusters inside the bigger clusters.</p>
<p>In Apache Mahout, separate algorithms can be used for finding clusters at
different levels. </p>
Modified:
websites/staging/mahout/trunk/content/users/clustering/k-means-clustering.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/k-means-clustering.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/k-means-clustering.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="k-means-clustering-basics">k-Means clustering - basics</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="k-means-clustering-basics">k-Means clustering - basics<a
class="headerlink" href="#k-means-clustering-basics" title="Permanent
link">¶</a></h1>
<p><a href="http://en.wikipedia.org/wiki/Kmeans">k-Means</a> is a simple but
well-known algorithm for grouping objects, clustering. All objects need to be
represented
as a set of numerical features. In addition, the user has to specify the
number of groups (referred to as <em>k</em>) she wishes to identify.</p>
@@ -284,7 +296,7 @@ computation of new average centers have
estimation of the number of clusters <em>k</em>. Yet the main principle always
remains the same.</p>
<p><a name="K-MeansClustering-Quickstart"></a></p>
-<h2 id="quickstart">Quickstart</h2>
+<h2 id="quickstart">Quickstart<a class="headerlink" href="#quickstart"
title="Permanent link">¶</a></h2>
<p><a
href="https://github.com/apache/mahout/blob/master/examples/bin/cluster-reuters.sh">Here</a>
is a short shell script outline that will get you started quickly with
k-means. This does the following:</p>
@@ -301,7 +313,7 @@ reuters-out from reuters-sgm (the downlo
<p>After following through the output that scrolls past, reading the code will
offer you a better understanding.</p>
<p><a name="K-MeansClustering-Designofimplementation"></a></p>
-<h2 id="implementation">Implementation</h2>
+<h2 id="implementation">Implementation<a class="headerlink"
href="#implementation" title="Permanent link">¶</a></h2>
<p>The implementation accepts two input directories: one for the data points
and one for the initial clusters. The data directory contains multiple
input files of SequenceFile(Key, VectorWritable), while the clusters
@@ -330,7 +342,7 @@ iteration and 'clusteredPoints' will con
implementation provided by Mahout:
<img src="../../images/Example implementation of k-Means provided with
Mahout.png"></p>
<p><a name="K-MeansClustering-Runningk-MeansClustering"></a></p>
-<h2 id="running-k-means-clustering">Running k-Means Clustering</h2>
+<h2 id="running-k-means-clustering">Running k-Means Clustering<a
class="headerlink" href="#running-k-means-clustering" title="Permanent
link">¶</a></h2>
<p>The k-Means clustering algorithm may be run using a command-line invocation
on KMeansDriver.main or by making a Java call to KMeansDriver.runJob().</p>
<p>Invocation using the command line takes the form:</p>
@@ -386,7 +398,7 @@ clustering, the weights are computed as
is between the cluster center and the vector using the chosen
DistanceMeasure.</p>
<p><a name="K-MeansClustering-Examples"></a></p>
-<h1 id="examples">Examples</h1>
+<h1 id="examples">Examples<a class="headerlink" href="#examples"
title="Permanent link">¶</a></h1>
<p>The following images illustrate k-Means clustering applied to a set of
randomly-generated 2-d data points. The points are generated using a normal
distribution centered at a mean location and with a constant standard
Modified:
websites/staging/mahout/trunk/content/users/clustering/k-means-commandline.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/k-means-commandline.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/k-means-commandline.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,12 +264,23 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="k-means-commandline-Introduction"></a></p>
-<h1 id="kmeans-commandline-introduction">kMeans commandline introduction</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="k-means-commandline-Introduction"></a></p>
+<h1 id="kmeans-commandline-introduction">kMeans commandline introduction<a
class="headerlink" href="#kmeans-commandline-introduction" title="Permanent
link">¶</a></h1>
<p>This quick start page describes how to run the kMeans clustering algorithm
on a Hadoop cluster. </p>
<p><a name="k-means-commandline-Steps"></a></p>
-<h1 id="steps">Steps</h1>
+<h1 id="steps">Steps<a class="headerlink" href="#steps" title="Permanent
link">¶</a></h1>
<p>Mahout's k-Means clustering can be launched from the same command line
invocation whether you are running on a single machine in stand-alone mode
or on a larger Hadoop cluster. The difference is determined by the
@@ -285,7 +297,7 @@ will be generated in $MAHOUT_HOME/core/t
the Mahout version number. For example, when using Mahout 0.3 release, the
job will be mahout-core-0.3.job</p>
<p><a name="k-means-commandline-Testingitononesinglemachinew/ocluster"></a></p>
-<h2 id="testing-it-on-one-single-machine-wo-cluster">Testing it on one single
machine w/o cluster</h2>
+<h2 id="testing-it-on-one-single-machine-wo-cluster">Testing it on one single
machine w/o cluster<a class="headerlink"
href="#testing-it-on-one-single-machine-wo-cluster" title="Permanent
link">¶</a></h2>
<ul>
<li>Put the data: cp <PATH TO DATA> testdata</li>
<li>
@@ -296,7 +308,7 @@ org.apache.mahout.common.distance.Cosine
</li>
</ul>
<p><a name="k-means-commandline-Runningitonthecluster"></a></p>
-<h2 id="running-it-on-the-cluster">Running it on the cluster</h2>
+<h2 id="running-it-on-the-cluster">Running it on the cluster<a
class="headerlink" href="#running-it-on-the-cluster" title="Permanent
link">¶</a></h2>
<ul>
<li>(As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh</li>
<li>Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata</li>
@@ -312,7 +324,7 @@ to view all outputs.</p>
</li>
</ul>
<p><a name="k-means-commandline-Commandlineoptions"></a></p>
-<h1 id="command-line-options">Command line options</h1>
+<h1 id="command-line-options">Command line options<a class="headerlink"
href="#command-line-options" title="Permanent link">¶</a></h1>
<div class="codehilite"><pre> <span class="o">--</span><span
class="n">input</span> <span class="p">(</span><span class="o">-</span><span
class="nb">i</span><span class="p">)</span> <span class="n">input</span>
<span class="n">Path</span> <span class="n">to</span> <span
class="n">job</span> <span class="n">input</span> <span
class="n">directory</span><span class="p">.</span>
<span class="n">Must</span> <span
class="n">be</span> <span class="n">a</span> <span
class="n">SequenceFile</span> <span class="n">of</span>
<span class="n">VectorWritable</span>
Modified:
websites/staging/mahout/trunk/content/users/clustering/latent-dirichlet-allocation.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/latent-dirichlet-allocation.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/latent-dirichlet-allocation.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="LatentDirichletAllocation-Overview"></a></p>
-<h1 id="overview">Overview</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="LatentDirichletAllocation-Overview"></a></p>
+<h1 id="overview">Overview<a class="headerlink" href="#overview"
title="Permanent link">¶</a></h1>
<p>Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
algorithm for automatically and jointly clustering words into "topics" and
documents into mixtures of topics. It has been successfully applied to
@@ -295,7 +307,7 @@ to have come from one of the models in t
which. The way we deal with that is to use a so-called latent parameter
which specifies which model each data point came from.</p>
<p><a name="LatentDirichletAllocation-CollapsedVariationalBayes"></a></p>
-<h1 id="collapsed-variational-bayes">Collapsed Variational Bayes</h1>
+<h1 id="collapsed-variational-bayes">Collapsed Variational Bayes<a
class="headerlink" href="#collapsed-variational-bayes" title="Permanent
link">¶</a></h1>
<p>The CVB algorithm which is implemented in Mahout for LDA combines
advantages of both regular Variational Bayes and Gibbs Sampling. The
algorithm relies on modeling dependence of parameters on latest variables
@@ -315,7 +327,7 @@ the order of O(K) with each update to q(
document/word pair only 1 copy of the variational posterior is required
over the latent variable.</p>
<p><a name="LatentDirichletAllocation-InvocationandUsage"></a></p>
-<h1 id="invocation-and-usage">Invocation and Usage</h1>
+<h1 id="invocation-and-usage">Invocation and Usage<a class="headerlink"
href="#invocation-and-usage" title="Permanent link">¶</a></h1>
<p>Mahout's implementation of LDA operates on a collection of SparseVectors of
word counts. These word counts should be non-negative integers, though
things will-- probably --work fine if you use non-negative reals. (Note
@@ -360,7 +372,7 @@ LDAPrintTopics utility:</p>
<p><a name="LatentDirichletAllocation-Example"></a></p>
-<h1 id="example">Example</h1>
+<h1 id="example">Example<a class="headerlink" href="#example" title="Permanent
link">¶</a></h1>
<p>An example is located in mahout/examples/bin/build-reuters.sh. The script
automatically downloads the Reuters-21578 corpus, builds a Lucene index and
converts the Lucene index to vectors. By uncommenting the last two lines
@@ -370,7 +382,7 @@ resultant topics to the console. </p>
support for Reuters, and that building your own index will require some
adaptation. The rest should hopefully not differ too much.</p>
<p><a name="LatentDirichletAllocation-ParameterEstimation"></a></p>
-<h1 id="parameter-estimation">Parameter Estimation</h1>
+<h1 id="parameter-estimation">Parameter Estimation<a class="headerlink"
href="#parameter-estimation" title="Permanent link">¶</a></h1>
<p>We use mean field variational inference to estimate the models. Variational
inference can be thought of as a generalization of <a
href="expectation-maximization.html">EM</a>
for hierarchical Bayesian models. The E-Step takes the form of, for each
@@ -383,7 +395,7 @@ distribution over the entire vocabulary
executed in the reduce step, with the final normalization happening as a
post-processing step.</p>
<p><a name="LatentDirichletAllocation-References"></a></p>
-<h1 id="references">References</h1>
+<h1 id="references">References<a class="headerlink" href="#references"
title="Permanent link">¶</a></h1>
<p><a
href="-http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf">David
M. Blei, Andrew Y. Ng, Michael I. Jordan, John Lafferty. 2003. Latent
Dirichlet Allocation. JMLR.</a></p>
<p><a href="http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf">Thomas
L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. PNAS. </a></p>
<p><a href="-http://aclweb.org/anthology//D/D08/D08-1038.pdf">David Hall, Dan
Jurafsky, and Christopher D. Manning. 2008. Studying the History of Ideas Using
Topic Models </a></p>
Modified:
websites/staging/mahout/trunk/content/users/clustering/lda-commandline.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/clustering/lda-commandline.html
(original)
+++ websites/staging/mahout/trunk/content/users/clustering/lda-commandline.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a
name="lda-commandline-RunningLatentDirichletAllocation(algorithm)fromtheCommandLine"></a></p>
-<h1
id="running-latent-dirichlet-allocation-algorithm-from-the-command-line">Running
Latent Dirichlet Allocation (algorithm) from the Command Line</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a
name="lda-commandline-RunningLatentDirichletAllocation(algorithm)fromtheCommandLine"></a></p>
+<h1
id="running-latent-dirichlet-allocation-algorithm-from-the-command-line">Running
Latent Dirichlet Allocation (algorithm) from the Command Line<a
class="headerlink"
href="#running-latent-dirichlet-allocation-algorithm-from-the-command-line"
title="Permanent link">¶</a></h1>
<p><a href="https://issues.apache.org/jira/browse/MAHOUT-897">Since Mahout
v0.6</a>
lda has been implemented as Collapsed Variable Bayes (cvb). </p>
<p>Mahout's LDA can be launched from the same command line invocation whether
@@ -285,7 +297,7 @@ the Mahout version number. For example,
job will be mahout-core-0.3.job</li>
</ul>
<p><a name="lda-commandline-Testingitononesinglemachinew/ocluster"></a></p>
-<h2 id="testing-it-on-one-single-machine-wo-cluster">Testing it on one single
machine w/o cluster</h2>
+<h2 id="testing-it-on-one-single-machine-wo-cluster">Testing it on one single
machine w/o cluster<a class="headerlink"
href="#testing-it-on-one-single-machine-wo-cluster" title="Permanent
link">¶</a></h2>
<ul>
<li>Put the data: cp <PATH TO DATA> testdata</li>
<li>
@@ -294,7 +306,7 @@ job will be mahout-core-0.3.job</li>
</li>
</ul>
<p><a name="lda-commandline-Runningitonthecluster"></a></p>
-<h2 id="running-it-on-the-cluster">Running it on the cluster</h2>
+<h2 id="running-it-on-the-cluster">Running it on the cluster<a
class="headerlink" href="#running-it-on-the-cluster" title="Permanent
link">¶</a></h2>
<ul>
<li>(As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh</li>
<li>Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata</li>
@@ -310,7 +322,7 @@ to view all outputs.</p>
</li>
</ul>
<p><a name="lda-commandline-CommandlineoptionsfromMahoutcvbversion0.8"></a></p>
-<h1 id="command-line-options-from-mahout-cvb-version-08">Command line options
from Mahout cvb version 0.8</h1>
+<h1 id="command-line-options-from-mahout-cvb-version-08">Command line options
from Mahout cvb version 0.8<a class="headerlink"
href="#command-line-options-from-mahout-cvb-version-08" title="Permanent
link">¶</a></h1>
<div class="codehilite"><pre><span class="n">mahout</span> <span
class="n">cvb</span> <span class="o">-</span><span class="n">h</span>
<span class="o">--</span><span class="n">input</span> <span
class="p">(</span><span class="o">-</span><span class="nb">i</span><span
class="p">)</span> <span class="n">input</span> <span
class="n">Path</span> <span class="n">to</span> <span class="n">job</span>
<span class="n">input</span> <span class="n">directory</span><span
class="p">.</span>
<span class="o">--</span><span class="n">output</span> <span
class="p">(</span><span class="o">-</span><span class="n">o</span><span
class="p">)</span> <span class="n">output</span> <span
class="n">The</span> <span class="n">directory</span> <span
class="n">pathname</span> <span class="k">for</span> <span
class="n">output</span><span class="p">.</span>
Modified:
websites/staging/mahout/trunk/content/users/clustering/llr---log-likelihood-ratio.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/llr---log-likelihood-ratio.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/llr---log-likelihood-ratio.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="likelihood-ratio-test">Likelihood ratio test</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="likelihood-ratio-test">Likelihood ratio test<a class="headerlink"
href="#likelihood-ratio-test" title="Permanent link">¶</a></h1>
<p><em>Likelihood ratio test is used to compare the fit of two models one
of which is nested within the other.</em></p>
<p>In the context of machine learning and the Mahout project in particular,
Modified:
websites/staging/mahout/trunk/content/users/clustering/spectral-clustering.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/clustering/spectral-clustering.html
(original)
+++
websites/staging/mahout/trunk/content/users/clustering/spectral-clustering.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="spectral-clustering-overview">Spectral Clustering Overview</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="spectral-clustering-overview">Spectral Clustering Overview<a
class="headerlink" href="#spectral-clustering-overview" title="Permanent
link">¶</a></h1>
<p>Spectral clustering, as its name implies, makes use of the spectrum (or
eigenvalues) of the similarity matrix of the data. It examines the
<em>connectedness</em> of the data, whereas other clustering algorithms such as
k-means use the <em>compactness</em> to assign clusters. Consequently, in
situations where k-means performs well, spectral clustering will also perform
well. Additionally, there are situations in which k-means will underperform
(e.g. concentric circles), but spectral clustering will be able to segment the
underlying clusters. Spectral clustering is also very useful for image
segmentation.</p>
<p>At its simplest, spectral clustering relies on the following four steps:</p>
<ol>
@@ -281,16 +293,16 @@
</li>
</ol>
<p>For more theoretical background on spectral clustering, such as how
affinity matrices are computed, the different types of graph Laplacians, and
whether the top or bottom eigenvectors and eigenvalues are computed, please
read <a
href="http://link.springer.com/article/10.1007/s11222-007-9033-z">Ulrike von
Luxburg's article in <em>Statistics and Computing</em> from December 2007</a>.
It provides an excellent description of the linear algebra operations behind
spectral clustering, and imbues a thorough understanding of the types of
situations in which it can be used.</p>
-<h1 id="mahout-spectral-clustering">Mahout Spectral Clustering</h1>
+<h1 id="mahout-spectral-clustering">Mahout Spectral Clustering<a
class="headerlink" href="#mahout-spectral-clustering" title="Permanent
link">¶</a></h1>
<p>As of Mahout 0.3, spectral clustering has been implemented to take
advantage of the MapReduce framework. It uses <a
href="http://mahout.apache.org/users/dim-reduction/ssvd.html">SSVD</a> for
dimensionality reduction of the input data set, and <a
href="http://mahout.apache.org/users/clustering/k-means-clustering.html">k-means</a>
to perform the final clustering.</p>
<p><strong>(<a
href="https://issues.apache.org/jira/browse/MAHOUT-1538">MAHOUT-1538</a> will
port the existing Hadoop MapReduce implementation to Mahout DSL, allowing for
one of several distinct distributed back-ends to conduct the
computation)</strong></p>
-<h2 id="input">Input</h2>
+<h2 id="input">Input<a class="headerlink" href="#input" title="Permanent
link">¶</a></h2>
<p>The input format for the algorithm currently takes the form of a
Hadoop-backed affinity matrix in the form of text files. Each line of the text
file specifies a single element of the affinity matrix: the row index
<code>\(i\)</code>, the column index <code>\(j\)</code>, and the value:</p>
<p><code>i, j, value</code></p>
<p>The affinity matrix is symmetric, and any unspecified <code>\(i, j\)</code>
pairs are assumed to be 0 for sparsity. The row and column indices are
0-indexed. Thus, only the non-zero entries of either the upper or lower
triangular need be specified.</p>
<p>The matrix elements specified in the text files are collected into a Mahout
<code>DistributedRowMatrix</code>.</p>
<p><strong>(<a
href="https://issues.apache.org/jira/browse/MAHOUT-1539">MAHOUT-1539</a> will
allow for the creation of the affinity matrix to occur as part of the core
spectral clustering algorithm, as opposed to the current requirement that the
user create this matrix themselves and provide it, rather than the original
data, to the algorithm)</strong></p>
-<h2 id="running-spectral-clustering">Running spectral clustering</h2>
+<h2 id="running-spectral-clustering">Running spectral clustering<a
class="headerlink" href="#running-spectral-clustering" title="Permanent
link">¶</a></h2>
<p><strong>(<a
href="https://issues.apache.org/jira/browse/MAHOUT-1540">MAHOUT-1540</a> will
provide a running example of this algorithm and this section will be updated to
show how to run the example and what the expected output should be; until then,
this section provides a how-to for simply running the algorithm on arbitrary
input)</strong></p>
<p>Spectral clustering can be invoked with the following arguments.</p>
<div class="codehilite"><pre><span class="n">bin</span><span
class="o">/</span><span class="n">mahout</span> <span
class="n">spectralkmeans</span> <span class="o">\</span>
@@ -303,7 +315,7 @@
<p>The affinity matrix can be contained in a single text file (using the
aforementioned one-line-per-entry format) or span many text files <a
href="https://issues.apache.org/jira/browse/MAHOUT-978">per (MAHOUT-978</a>, do
not prefix text files with a leading underscore '_' or period '.'). The
<code>-d</code> flag is required for the algorithm to know the dimensions of
the affinity matrix. <code>-k</code> is the number of top eigenvectors from the
normalized graph Laplacian in the SSVD step, and also the number of clusters
given to k-means after the SSVD step.</p>
-<h2 id="example">Example</h2>
+<h2 id="example">Example<a class="headerlink" href="#example" title="Permanent
link">¶</a></h2>
<p>To provide a simple example, take the following affinity matrix, contained
in a text file called <code>affinity.txt</code>:</p>
<div class="codehilite"><pre>0<span class="p">,</span> 0<span
class="p">,</span> 0
0<span class="p">,</span> 1<span class="p">,</span> 0<span class="p">.</span>8