Modified:
websites/staging/mahout/trunk/content/users/basics/mahoutintegration.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/mahoutintegration.html
(original)
+++ websites/staging/mahout/trunk/content/users/basics/mahoutintegration.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
Modified:
websites/staging/mahout/trunk/content/users/basics/matrix-and-vector-needs.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/basics/matrix-and-vector-needs.html
(original)
+++
websites/staging/mahout/trunk/content/users/basics/matrix-and-vector-needs.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="MatrixandVectorNeeds-Intro"></a></p>
-<h1 id="intro">Intro</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="MatrixandVectorNeeds-Intro"></a></p>
+<h1 id="intro">Intro<a class="headerlink" href="#intro" title="Permanent
link">¶</a></h1>
<p>Most ML algorithms require the ability to represent multidimensional data
concisely and to be able to easily perform common operations on that data.
MAHOUT-6 introduced Vector and Matrix datatypes of arbitrary cardinality,
@@ -276,10 +288,10 @@ applications requiring vectors or matric
JVM, though such applications might be able to utilize them within a larger
organizing framework.</p>
<p><a name="MatrixandVectorNeeds-Background"></a></p>
-<h2 id="background">Background</h2>
+<h2 id="background">Background<a class="headerlink" href="#background"
title="Permanent link">¶</a></h2>
<p>See <a
href="http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser">http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser</a></p>
<p><a name="MatrixandVectorNeeds-Vectors"></a></p>
-<h2 id="vectors">Vectors</h2>
+<h2 id="vectors">Vectors<a class="headerlink" href="#vectors" title="Permanent
link">¶</a></h2>
<p>Mahout supports a Vector interface that defines the following operations
over all implementation classes: assign, cardinality, copy, divide, dot, get,
haveSharedCells, like, minus, normalize, plus, set, size, times, toArray,
viewPart, zSum and cross. The class DenseVector implements vectors as a
double<a href=".html"></a>
that is storage and access efficient. The class SparseVector implements
vectors as a HashMap<Integer, Double> that is surprisingly fast and
@@ -289,7 +301,7 @@ dimensions it holds. An additional Vecto
underlying vector to be specified by the viewPart() method. See the
JavaDocs for more complete definitions.</p>
<p><a name="MatrixandVectorNeeds-Matrices"></a></p>
-<h2 id="matrices">Matrices</h2>
+<h2 id="matrices">Matrices<a class="headerlink" href="#matrices"
title="Permanent link">¶</a></h2>
<p>Mahout also supports a Matrix interface that defines a similar set of
operations over all implementation classes: assign, assignColumn, assignRow,
cardinality, copy, divide, get, haveSharedCells, like, minus, plus, set, size,
times, transpose, toArray, viewPart and zSum. The class DenseMatrix implements
matrices as a double<a href=".html"></a>
[] that is storage and access efficient. The class SparseRowMatrix
implements matrices as a Vector[] holding the rows of the matrix in a
@@ -317,7 +329,7 @@ eigenvectors would also be useful. Batch
also be useful, such as perhaps assignRow or assighColumn accepting
UnaryFunction and BinaryFunction arguments.</p>
<p><a name="MatrixandVectorNeeds-Ideas"></a></p>
-<h2 id="ideas">Ideas</h2>
+<h2 id="ideas">Ideas<a class="headerlink" href="#ideas" title="Permanent
link">¶</a></h2>
<p>As Vector and Matrix implementations are currently memory-resident, very
large instances greater than available memory are not supported. An
extended set of implementations that use HBase (BigTable) in Hadoop to
@@ -326,7 +338,7 @@ large collections.<br />
See <a href="https://issues.apache.org/jira/browse/MAHOUT-6">MAHOUT-6</a>
See <a href="http://wiki.apache.org/hadoop/Hama">Hama</a></p>
<p><a name="MatrixandVectorNeeds-References"></a></p>
-<h2 id="references">References</h2>
+<h2 id="references">References<a class="headerlink" href="#references"
title="Permanent link">¶</a></h2>
<p>Have a look at the old parallel computing libraries like <a
href="http://www.netlib.org/scalapack/">ScalaPACK</a>
, others</p>
</div>
Modified:
websites/staging/mahout/trunk/content/users/basics/principal-components-analysis.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/basics/principal-components-analysis.html
(original)
+++
websites/staging/mahout/trunk/content/users/basics/principal-components-analysis.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a
name="PrincipalComponentsAnalysis-PrincipalComponentsAnalysis"></a></p>
-<h1 id="principal-components-analysis">Principal Components Analysis</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="PrincipalComponentsAnalysis-PrincipalComponentsAnalysis"></a></p>
+<h1 id="principal-components-analysis">Principal Components Analysis<a
class="headerlink" href="#principal-components-analysis" title="Permanent
link">¶</a></h1>
<p>PCA is used to reduce high dimensional data set to lower dimensions. PCA
can be used to identify patterns in data, express the data in a lower
dimensional space. That way, similarities and differences can be
@@ -280,9 +292,9 @@ this limitation.</li>
<li>Large variances are assumed to have important dynamics.</li>
</ul>
<p><a name="PrincipalComponentsAnalysis-Parallelizationstrategy"></a></p>
-<h2 id="parallelization-strategy">Parallelization strategy</h2>
+<h2 id="parallelization-strategy">Parallelization strategy<a
class="headerlink" href="#parallelization-strategy" title="Permanent
link">¶</a></h2>
<p><a name="PrincipalComponentsAnalysis-Designofpackages"></a></p>
-<h2 id="design-of-packages">Design of packages</h2>
+<h2 id="design-of-packages">Design of packages<a class="headerlink"
href="#design-of-packages" title="Permanent link">¶</a></h2>
</div>
</div>
</div>
Modified: websites/staging/mahout/trunk/content/users/basics/quickstart.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/quickstart.html
(original)
+++ websites/staging/mahout/trunk/content/users/basics/quickstart.html Fri Apr
8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,12 +264,23 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="mahout-mapreduce-overview">Mahout MapReduce Overview</h1>
-<h2 id="getting-mahout">Getting Mahout</h2>
-<h4 id="download-the-latest-release">Download the latest release</h4>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="mahout-mapreduce-overview">Mahout MapReduce Overview<a
class="headerlink" href="#mahout-mapreduce-overview" title="Permanent
link">¶</a></h1>
+<h2 id="getting-mahout">Getting Mahout<a class="headerlink"
href="#getting-mahout" title="Permanent link">¶</a></h2>
+<h4 id="download-the-latest-release">Download the latest release<a
class="headerlink" href="#download-the-latest-release" title="Permanent
link">¶</a></h4>
<p>Download the latest release <a
href="http://www.apache.org/dyn/closer.cgi/mahout/">here</a>.</p>
<p>Or checkout the latest code from <a
href="http://mahout.apache.org/developers/version-control.html">here</a></p>
-<h4 id="alternatively-add-mahout-0100-to-a-maven-project">Alternatively: Add
Mahout 0.10.0 to a maven project</h4>
+<h4 id="alternatively-add-mahout-0100-to-a-maven-project">Alternatively: Add
Mahout 0.10.0 to a maven project<a class="headerlink"
href="#alternatively-add-mahout-0100-to-a-maven-project" title="Permanent
link">¶</a></h4>
<p>Mahout is also available via a <a
href="http://mvnrepository.com/artifact/org.apache.mahout">maven repository</a>
under the group id <em>org.apache.mahout</em>.
If you would like to import the latest release of mahout into a java project,
add the following dependency in your <em>pom.xml</em>:</p>
<div class="codehilite"><pre><span class="nt"><dependency></span>
@@ -279,20 +291,20 @@ If you would like to import the latest r
</pre></div>
-<h2 id="features">Features</h2>
+<h2 id="features">Features<a class="headerlink" href="#features"
title="Permanent link">¶</a></h2>
<p>For a full list of Mahout's features see our <a
href="http://mahout.apache.org/users/basics/algorithms.html">Features by
Engine</a> page.</p>
-<h2 id="using-mahout">Using Mahout</h2>
+<h2 id="using-mahout">Using Mahout<a class="headerlink" href="#using-mahout"
title="Permanent link">¶</a></h2>
<p>Mahout has prepared a bunch of examples and tutorials for users to quickly
learn how to use its machine learning algorithms.</p>
-<h4 id="recommendations">Recommendations</h4>
+<h4 id="recommendations">Recommendations<a class="headerlink"
href="#recommendations" title="Permanent link">¶</a></h4>
<p>Check the <a href="/users/recommender/quickstart.html">Recommender
Quickstart</a> or the tutorial on <a
href="/users/recommender/userbased-5-minutes.html">creating a userbased
recommender in 5 minutes</a>.</p>
<p>If you are building a recommender system for the first time, please also
refer to a list of <a
href="/users/recommender/recommender-first-timer-faq.html">Dos and Don'ts</a>
that might be helpful.</p>
-<h4 id="clustering">Clustering</h4>
+<h4 id="clustering">Clustering<a class="headerlink" href="#clustering"
title="Permanent link">¶</a></h4>
<p>Check the <a
href="/users/clustering/clustering-of-synthetic-control-data.html">Synthetic
data</a> example.</p>
-<h4 id="classification">Classification</h4>
+<h4 id="classification">Classification<a class="headerlink"
href="#classification" title="Permanent link">¶</a></h4>
<p>If you are interested in how to train a <strong>Naive Bayes</strong> model,
look at the <a href="/users/classification/twenty-newsgroups.html">20
newsgroups</a> example.</p>
<p>If you plan to build a <strong>Hidden Markov Model</strong> for speech
recognition, the example <a
href="/users/classification/hidden-markov-models.html">here</a> might be
instructive. </p>
<p>Or you could build a <strong>Random Forest</strong> model by following this
<a href="/users/classification/partial-implementation.html">quick start
page</a>.</p>
-<h4 id="working-with-text">Working with Text</h4>
+<h4 id="working-with-text">Working with Text<a class="headerlink"
href="#working-with-text" title="Permanent link">¶</a></h4>
<p>If you need to convert raw text into word vectors as input to clustering or
classification algorithms, please refer to this page on <a
href="/users/basics/creating-vectors-from-text.html">how to create vectors from
text</a>.</p>
</div>
</div>
Modified:
websites/staging/mahout/trunk/content/users/basics/svd---singular-value-decomposition.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/basics/svd---singular-value-decomposition.html
(original)
+++
websites/staging/mahout/trunk/content/users/basics/svd---singular-value-decomposition.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p>{excerpt}Singular Value Decomposition is a form of product
decomposition of
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>{excerpt}Singular Value Decomposition is a form of product decomposition of
a matrix in which a rectangular matrix A is decomposed into a product U s
V' where U and V are orthonormal and s is a diagonal matrix.{excerpt} The
values of A can be real or complex, but the real case dominates
Modified:
websites/staging/mahout/trunk/content/users/basics/system-requirements.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/system-requirements.html
(original)
+++ websites/staging/mahout/trunk/content/users/basics/system-requirements.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="system-requirements">System Requirements</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="system-requirements">System Requirements<a class="headerlink"
href="#system-requirements" title="Permanent link">¶</a></h1>
<ul>
<li>Java 1.6.x or greater.</li>
<li>Maven 3.x to build the source code.</li>
Modified:
websites/staging/mahout/trunk/content/users/basics/tf-idf---term-frequency-inverse-document-frequency.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/basics/tf-idf---term-frequency-inverse-document-frequency.html
(original)
+++
websites/staging/mahout/trunk/content/users/basics/tf-idf---term-frequency-inverse-document-frequency.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,7 +264,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p>{excerpt}Is a weight measure often used in information retrieval and
text
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>{excerpt}Is a weight measure often used in information retrieval and text
mining. This weight is a statistical measure used to evaluate how important
a word is to a document in a collection or corpus. The importance increases
proportionally to the number of times a word appears in the document but is
Modified:
websites/staging/mahout/trunk/content/users/classification/bankmarketing-example.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/bankmarketing-example.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/bankmarketing-example.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,17 +264,28 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="bank-marketing-example">Bank Marketing Example</h1>
-<h3 id="introduction">Introduction</h3>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="bank-marketing-example">Bank Marketing Example<a class="headerlink"
href="#bank-marketing-example" title="Permanent link">¶</a></h1>
+<h3 id="introduction">Introduction<a class="headerlink" href="#introduction"
title="Permanent link">¶</a></h3>
<p>This page describes how to run Mahout's SGD classifier on the <a
href="http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing">UCI Bank Marketing
dataset</a>.
The goal is to predict if the client will subscribe a term deposit offered via
a phone call. The features in the dataset consist
of information such as age, job, marital status as well as information about
the last contacts from the bank.</p>
-<h3 id="code-data">Code & Data</h3>
+<h3 id="code-data">Code & Data<a class="headerlink" href="#code-data"
title="Permanent link">¶</a></h3>
<p>The bank marketing example code lives under </p>
<p><em>mahout-examples/src/main/java/org.apache.mahout.classifier.sgd.bankmarketing</em></p>
<p>The data can be found at </p>
<p><em>mahout-examples/src/main/resources/bank-full.csv</em></p>
-<h3 id="code-details">Code details</h3>
+<h3 id="code-details">Code details<a class="headerlink" href="#code-details"
title="Permanent link">¶</a></h3>
<p>This example consists of 3 classes:</p>
<ul>
<li>BankMarketingClassificationMain</li>
Modified:
websites/staging/mahout/trunk/content/users/classification/bayesian-commandline.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/bayesian-commandline.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/bayesian-commandline.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,15 +264,26 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="naive-bayes-commandline-documentation">Naive Bayes commandline
documentation</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="naive-bayes-commandline-documentation">Naive Bayes commandline
documentation<a class="headerlink"
href="#naive-bayes-commandline-documentation" title="Permanent
link">¶</a></h1>
<p><a name="bayesian-commandline-Introduction"></a></p>
-<h2 id="introduction">Introduction</h2>
+<h2 id="introduction">Introduction<a class="headerlink" href="#introduction"
title="Permanent link">¶</a></h2>
<p>This quick start page describes how to run the naive bayesian and
complementary naive bayesian classification algorithms on a Hadoop cluster.</p>
<p><a name="bayesian-commandline-Steps"></a></p>
-<h2 id="steps">Steps</h2>
+<h2 id="steps">Steps<a class="headerlink" href="#steps" title="Permanent
link">¶</a></h2>
<p><a
name="bayesian-commandline-Testingitononesinglemachinew/ocluster"></a></p>
-<h3 id="testing-it-on-one-single-machine-wo-cluster">Testing it on one single
machine w/o cluster</h3>
+<h3 id="testing-it-on-one-single-machine-wo-cluster">Testing it on one single
machine w/o cluster<a class="headerlink"
href="#testing-it-on-one-single-machine-wo-cluster" title="Permanent
link">¶</a></h3>
<p>In the examples directory type:</p>
<div class="codehilite"><pre><span class="n">mvn</span> <span
class="o">-</span><span class="n">q</span> <span class="n">exec</span><span
class="p">:</span><span class="n">java</span>
<span class="o">-</span><span class="n">Dexec</span><span
class="p">.</span><span class="n">mainClass</span><span
class="p">=</span>"<span class="n">org</span><span class="p">.</span><span
class="n">apache</span><span class="p">.</span><span
class="n">mahout</span><span class="p">.</span><span
class="n">classifier</span><span class="p">.</span><span
class="n">bayes</span><span class="p">.</span><span
class="n">mapreduce</span><span class="p">.</span><span
class="n">bayes</span><span class="p">.</span><span class="o"><</span><span
class="n">JOB</span><span class="o">></span>"
@@ -284,7 +296,7 @@ complementary naive bayesian classificat
<p><a name="bayesian-commandline-Runningitonthecluster"></a></p>
-<h3 id="running-it-on-the-cluster">Running it on the cluster</h3>
+<h3 id="running-it-on-the-cluster">Running it on the cluster<a
class="headerlink" href="#running-it-on-the-cluster" title="Permanent
link">¶</a></h3>
<ul>
<li>
<p>In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
@@ -309,7 +321,7 @@ to view all outputs.</p>
</li>
</ul>
<p><a name="bayesian-commandline-Commandlineoptions"></a></p>
-<h2 id="command-line-options">Command line options</h2>
+<h2 id="command-line-options">Command line options<a class="headerlink"
href="#command-line-options" title="Permanent link">¶</a></h2>
<div class="codehilite"><pre><span class="n">BayesDriver</span><span
class="p">,</span> <span class="n">BayesThetaNormalizerDriver</span><span
class="p">,</span> <span class="n">CBayesNormalizedWeightDriver</span><span
class="p">,</span> <span class="n">CBayesDriver</span><span class="p">,</span>
<span class="n">CBayesThetaDriver</span><span class="p">,</span> <span
class="n">CBayesThetaNormalizerDriver</span><span class="p">,</span> <span
class="n">BayesWeightSummerDriver</span><span class="p">,</span> <span
class="n">BayesFeatureDriver</span><span class="p">,</span> <span
class="n">BayesTfIdfDriver</span> <span class="n">Usage</span><span
class="p">:</span>
<span class="p">[</span><span class="o">--</span><span
class="n">input</span> <span class="o"><</span><span
class="n">input</span><span class="o">></span> <span
class="o">--</span><span class="n">output</span> <span
class="o"><</span><span class="n">output</span><span class="o">></span>
<span class="o">--</span><span class="n">help</span><span class="p">]</span>
Modified:
websites/staging/mahout/trunk/content/users/classification/bayesian.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/classification/bayesian.html
(original)
+++ websites/staging/mahout/trunk/content/users/classification/bayesian.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,13 +264,24 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="naive-bayes">Naive Bayes</h1>
-<h2 id="intro">Intro</h2>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="naive-bayes">Naive Bayes<a class="headerlink" href="#naive-bayes"
title="Permanent link">¶</a></h1>
+<h2 id="intro">Intro<a class="headerlink" href="#intro" title="Permanent
link">¶</a></h2>
<p>Mahout currently has two Naive Bayes implementations. The first is
standard Multinomial Naive Bayes. The second is an implementation of
Transformed Weight-normalized Complement Naive Bayes as introduced by Rennie et
al. <a href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf">[1]</a>.
We refer to the former as Bayes and the latter as CBayes.</p>
<p>Where Bayes has long been a standard in text classification, CBayes is an
extension of Bayes that performs particularly well on datasets with skewed
classes and has been shown to be competitive with algorithms of higher
complexity such as Support Vector Machines. </p>
-<h2 id="implementations">Implementations</h2>
+<h2 id="implementations">Implementations<a class="headerlink"
href="#implementations" title="Permanent link">¶</a></h2>
<p>Both Bayes and CBayes are currently trained via MapReduce Jobs. Testing and
classification can be done via a MapReduce Job or sequentially. Mahout
provides CLI drivers for preprocessing, training and testing. A Spark
implementation is currently in the works (<a
href="https://issues.apache.org/jira/browse/MAHOUT-1493">MAHOUT-1493</a>).</p>
-<h2 id="preprocessing-and-algorithm">Preprocessing and Algorithm</h2>
+<h2 id="preprocessing-and-algorithm">Preprocessing and Algorithm<a
class="headerlink" href="#preprocessing-and-algorithm" title="Permanent
link">¶</a></h2>
<p>As described in <a
href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf">[1]</a> Mahout
Naive Bayes is broken down into the following steps (assignments are over all
possible index values): </p>
<ul>
<li>Let <code>\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)</code> be a set of
documents; <code>\(d_{ij}\)</code> is the count of word <code>\(i\)</code> in
document <code>\(j\)</code>.</li>
@@ -299,7 +311,7 @@
</li>
</ul>
<p>As we can see, the main difference between Bayes and CBayes is the weight
calculation step. Where Bayes weighs terms more heavily based on the
likelihood that they belong to class <code>\(c\)</code>, CBayes seeks to
maximize term weights on the likelihood that they do not belong to any other
class. </p>
-<h2 id="running-from-the-command-line">Running from the command line</h2>
+<h2 id="running-from-the-command-line">Running from the command line<a
class="headerlink" href="#running-from-the-command-line" title="Permanent
link">¶</a></h2>
<p>Mahout provides CLI drivers for all above steps. Here we will give a
simple overview of Mahout CLI commands used to preprocess the data, train the
model and assign labels to the training set. An <a
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh">example
script</a> is given for the full process from data acquisition through
classification of the classic <a
href="https://mahout.apache.org/users/classification/twenty-newsgroups.html">20
Newsgroups corpus</a>. </p>
<ul>
<li>
@@ -344,7 +356,7 @@ Classification and testing on a holdout
</li>
</ul>
-<h2 id="command-line-options">Command line options</h2>
+<h2 id="command-line-options">Command line options<a class="headerlink"
href="#command-line-options" title="Permanent link">¶</a></h2>
<ul>
<li><strong>Preprocessing:</strong></li>
</ul>
@@ -407,12 +419,12 @@ Classification and testing on a holdout
</li>
</ul>
-<h2 id="examples">Examples</h2>
+<h2 id="examples">Examples<a class="headerlink" href="#examples"
title="Permanent link">¶</a></h2>
<p>Mahout provides an example for Naive Bayes classification:</p>
<ol>
<li><a href="twenty-newsgroups.html">Classify 20 Newsgroups</a></li>
</ol>
-<h2 id="references">References</h2>
+<h2 id="references">References<a class="headerlink" href="#references"
title="Permanent link">¶</a></h2>
<p>[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003).
<a href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf">Tackling the
Poor Assumptions of Naive Bayes Text Classifiers</a>. Proceedings of the
Twentieth International Conference on Machine Learning (ICML-2003).</p>
</div>
</div>
Modified:
websites/staging/mahout/trunk/content/users/classification/breiman-example.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/breiman-example.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/breiman-example.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="breiman-example">Breiman Example</h1>
-<h4 id="introduction">Introduction</h4>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="breiman-example">Breiman Example<a class="headerlink"
href="#breiman-example" title="Permanent link">¶</a></h1>
+<h4 id="introduction">Introduction<a class="headerlink" href="#introduction"
title="Permanent link">¶</a></h4>
<p>This page describes how to run the Breiman example, which implements the
test procedure described in <a
href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23.3999&rep=rep1&type=pdf">Leo
Breiman's paper</a>. The basic algorithm is as follows :</p>
<ul>
<li>repeat <em>I</em> iterations</li>
@@ -281,7 +293,7 @@ results to greater values of <em>m</em><
<li>compute the mean test error for all iterations</li>
<li>compute the mean tree error for all iterations</li>
</ul>
-<h4 id="running-the-example">Running the Example</h4>
+<h4 id="running-the-example">Running the Example<a class="headerlink"
href="#running-the-example" title="Permanent link">¶</a></h4>
<p>The current implementation is compatible with the <a
href="http://archive.ics.uci.edu/ml/">UCI repository</a> file format. We'll
show how to run this example on two datasets:</p>
<p>First, we deal with <a
href="http://archive.ics.uci.edu/ml/datasets/Glass+Identification">Glass
Identification</a>: download the <a
href="http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data">dataset</a>
file called <strong>glass.data</strong> and store it onto your local machine.
Next, we must generate the descriptor file <strong>glass.info</strong> for this
dataset with the following command:</p>
<div class="codehilite"><pre><span class="n">bin</span><span
class="o">/</span><span class="n">mahout</span> <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">mahout</span><span class="p">.</span><span
class="n">classifier</span><span class="p">.</span><span
class="n">df</span><span class="p">.</span><span class="n">tools</span><span
class="p">.</span><span class="n">Describe</span> <span class="o">-</span><span
class="n">p</span> <span class="o">/</span><span class="n">path</span><span
class="o">/</span><span class="n">to</span><span class="o">/</span><span
class="n">glass</span><span class="p">.</span><span class="n">data</span> <span
class="o">-</span><span class="n">f</span> <span class="o">/</span><span
class="n">path</span><span class="o">/</span><span class="n">to</span><span
class="o">/</span><span class="n">glass</span><span class="p">.</span><span
class="n">info</span> <span class="o">-</span><span class=
"n">d</span> <span class="n">I</span> 9 <span class="n">N</span> <span
class="n">L</span>
Modified:
websites/staging/mahout/trunk/content/users/classification/class-discovery.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/class-discovery.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/class-discovery.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="ClassDiscovery-ClassDiscovery"></a></p>
-<h1 id="class-discovery">Class Discovery</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="ClassDiscovery-ClassDiscovery"></a></p>
+<h1 id="class-discovery">Class Discovery<a class="headerlink"
href="#class-discovery" title="Permanent link">¶</a></h1>
<p>See http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-417.pdf</p>
<p>CDGA uses a Genetic Algorithm to discover a classification rule for a given
dataset.
@@ -337,7 +349,7 @@ and the following parameters: threshold
<p>Please note how the rule skipped the label attribute (Eye Color), and how
the first condition is ignored because its weight is < threshold.</p>
<p><a name="ClassDiscovery-Runningtheexample:"></a></p>
-<h1 id="running-the-example">Running the example:</h1>
+<h1 id="running-the-example">Running the example:<a class="headerlink"
href="#running-the-example" title="Permanent link">¶</a></h1>
<p>NOTE: Substitute in the appropriate version for the Mahout JOB jar</p>
<ol>
<li>cd <MAHOUT_HOME>/examples</li>
Modified:
websites/staging/mahout/trunk/content/users/classification/classifyingyourdata.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/classifyingyourdata.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/classifyingyourdata.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,14 +264,25 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="classifying-data-from-the-command-line">Classifying data from the
command line</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="classifying-data-from-the-command-line">Classifying data from the
command line<a class="headerlink"
href="#classifying-data-from-the-command-line" title="Permanent
link">¶</a></h1>
<p>After you've done the <a href="../basics/quickstart.html">Quickstart</a>
and are familiar with the basics of Mahout, it is time to build a
classifier from your own data. The following pieces <em>may</em> be useful for
in getting started:</p>
<p><a name="ClassifyingYourData-Input"></a></p>
-<h1 id="input">Input</h1>
+<h1 id="input">Input<a class="headerlink" href="#input" title="Permanent
link">¶</a></h1>
<p>For starters, you will need your data in an appropriate Vector format: See
<a href="../basics/creating-vectors.html">Creating Vectors</a> as well as <a
href="../basics/creating-vectors-from-text.html">Creating Vectors from
Text</a>.</p>
<p><a name="ClassifyingYourData-RunningtheProcess"></a></p>
-<h1 id="running-the-process">Running the Process</h1>
+<h1 id="running-the-process">Running the Process<a class="headerlink"
href="#running-the-process" title="Permanent link">¶</a></h1>
<ul>
<li>Logistic regression <a href="logistic-regression.html">background</a></li>
<li><a href="naivebayes.html">Naive Bayes background</a> and <a
href="bayesian-commandline.html">commandline</a> options.</li>
Modified:
websites/staging/mahout/trunk/content/users/classification/hidden-markov-models.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/hidden-markov-models.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/hidden-markov-models.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,14 +264,25 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="hidden-markov-models">Hidden Markov Models</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="hidden-markov-models">Hidden Markov Models<a class="headerlink"
href="#hidden-markov-models" title="Permanent link">¶</a></h1>
<p><a name="HiddenMarkovModels-IntroductionandUsage"></a></p>
-<h2 id="introduction-and-usage">Introduction and Usage</h2>
+<h2 id="introduction-and-usage">Introduction and Usage<a class="headerlink"
href="#introduction-and-usage" title="Permanent link">¶</a></h2>
<p>Hidden Markov Models are used in multiple areas of Machine Learning, such
as speech recognition, handwritten letter recognition or natural language
processing. </p>
<p><a name="HiddenMarkovModels-FormalDefinition"></a></p>
-<h2 id="formal-definition">Formal Definition</h2>
+<h2 id="formal-definition">Formal Definition<a class="headerlink"
href="#formal-definition" title="Permanent link">¶</a></h2>
<p>A Hidden Markov Model (HMM) is a statistical model of a process consisting
of two (in our case discrete) random variables O and Y, which change their
state sequentially. The variable Y with states {y_1, ... , y_n} is called
@@ -288,7 +300,7 @@ current state of Y.</p>
containing the observation probabilities such that B[i,j]=
P(O=o_i|Y=y_j).</p>
<p><a name="HiddenMarkovModels-Problems"></a></p>
-<h2 id="problems">Problems</h2>
+<h2 id="problems">Problems<a class="headerlink" href="#problems"
title="Permanent link">¶</a></h2>
<p>Rabiner [1](1.html)
defined three main problems for HMM models:</p>
<ol>
@@ -304,7 +316,7 @@ model M*=argmax(M)P(O|M) to generate thi
can be efficiently solved using the Baum-Welch algorithm.</li>
</ol>
<p><a name="HiddenMarkovModels-Example"></a></p>
-<h2 id="example">Example</h2>
+<h2 id="example">Example<a class="headerlink" href="#example" title="Permanent
link">¶</a></h2>
<p>To build a Hidden Markov Model and use it to build some predictions, try a
simple example like this:</p>
<p>Create an input file to train the model. Here we have a sequence drawn
from the set of states 0, 1, 2, and 3, separated by space characters.</p>
<div class="codehilite"><pre>$ <span class="n">echo</span> "0 1 2 2 2 1 1
0 0 3 3 3 2 1 2 1 1 1 1 2 2 2 0 0 0 0 0 0 2 2 2 0 0 0 0 0 0 2 2 2 3 3 3 3 3 3 2
3 2 3 2 3 2 1 3 0 0 0 1 0 1 0 2 1 2 1 2 1 2 3 3 3 3 2 2 3 2 1 1 0" <span
class="o">></span> <span class="n">hmm</span><span class="o">-</span><span
class="n">input</span>
@@ -347,7 +359,7 @@ $ $<span class="n">MAHOUT_HOME</span><sp
<p><a name="HiddenMarkovModels-Resources"></a></p>
-<h2 id="resources">Resources</h2>
+<h2 id="resources">Resources<a class="headerlink" href="#resources"
title="Permanent link">¶</a></h2>
<p>[1]
Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models
and selected applications in speech recognition". Proceedings of the IEEE
Modified:
websites/staging/mahout/trunk/content/users/classification/locally-weighted-linear-regression.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/locally-weighted-linear-regression.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/locally-weighted-linear-regression.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a
name="LocallyWeightedLinearRegression-LocallyWeightedLinearRegression"></a></p>
-<h1 id="locally-weighted-linear-regression">Locally Weighted Linear
Regression</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a
name="LocallyWeightedLinearRegression-LocallyWeightedLinearRegression"></a></p>
+<h1 id="locally-weighted-linear-regression">Locally Weighted Linear
Regression<a class="headerlink" href="#locally-weighted-linear-regression"
title="Permanent link">¶</a></h1>
<p>Model-based methods, such as SVM, Naive Bayes and the mixture of Gaussians,
use the data to build a parameterized model. After training, the model is
used for predictions and the data are generally discarded. In contrast,
@@ -275,9 +287,9 @@ regression around a point of interest us
"local" to that point. Source:
http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/cohn96a-html/node7.html</p>
<p><a
name="LocallyWeightedLinearRegression-Strategyforparallelregression"></a></p>
-<h2 id="strategy-for-parallel-regression">Strategy for parallel regression</h2>
+<h2 id="strategy-for-parallel-regression">Strategy for parallel regression<a
class="headerlink" href="#strategy-for-parallel-regression" title="Permanent
link">¶</a></h2>
<p><a name="LocallyWeightedLinearRegression-Designofpackages"></a></p>
-<h2 id="design-of-packages">Design of packages</h2>
+<h2 id="design-of-packages">Design of packages<a class="headerlink"
href="#design-of-packages" title="Permanent link">¶</a></h2>
</div>
</div>
</div>
Modified:
websites/staging/mahout/trunk/content/users/classification/logistic-regression.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/logistic-regression.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/logistic-regression.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="LogisticRegression-LogisticRegression(SGD)"></a></p>
-<h1 id="logistic-regression-sgd">Logistic Regression (SGD)</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="LogisticRegression-LogisticRegression(SGD)"></a></p>
+<h1 id="logistic-regression-sgd">Logistic Regression (SGD)<a
class="headerlink" href="#logistic-regression-sgd" title="Permanent
link">¶</a></h1>
<p>Logistic regression is a model used for prediction of the probability of
occurrence of an event. It makes use of several predictor variables that
may be either numerical or categories.</p>
@@ -279,7 +291,7 @@ Paul Komarek</a> [1].</p>
<p>An example of training a Logistic Regression classifier for the <a
href="http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing">UCI Bank Marketing
Dataset</a> can be found <a
href="http://mahout.apache.org/users/classification/bankmarketing-example.html">on
the Mahout website</a> [3].</p>
<p>An example of training and testing a Logistic Regression document
classifier for the classic <a
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh">20
newsgroups corpus</a> [4] is also available. </p>
<p><a name="LogisticRegression-Parallelizationstrategy"></a></p>
-<h2 id="parallelization-strategy">Parallelization strategy</h2>
+<h2 id="parallelization-strategy">Parallelization strategy<a
class="headerlink" href="#parallelization-strategy" title="Permanent
link">¶</a></h2>
<p>The bad news is that SGD is an inherently sequential algorithm. The good
news is that it is blazingly fast and thus it is not a problem for Mahout's
implementation to handle training sets of tens of millions of examples.
@@ -298,7 +310,7 @@ CrossFoldLearners in separate threads, e
learning parameters. As better settings are found, these new settings are
propagating to the other learners.</p>
<p><a name="LogisticRegression-Designofpackages"></a></p>
-<h2 id="design-of-packages">Design of packages</h2>
+<h2 id="design-of-packages">Design of packages<a class="headerlink"
href="#design-of-packages" title="Permanent link">¶</a></h2>
<p>There are three packages that are used in Mahout's SGD system. These
include</p>
<ul>
@@ -313,7 +325,7 @@ include</p>
</li>
</ul>
<p><a name="LogisticRegression-Featurevectorencoding"></a></p>
-<h2 id="feature-vector-encoding">Feature vector encoding</h2>
+<h2 id="feature-vector-encoding">Feature vector encoding<a class="headerlink"
href="#feature-vector-encoding" title="Permanent link">¶</a></h2>
<p>Because the SGD algorithms need to have fixed length feature vectors and
because it is a pain to build a dictionary ahead of time, most SGD
applications use the hashed feature vector encoding system that is rooted
@@ -332,7 +344,7 @@ case you are getting your training data
<p>Here is a class diagram for the encoders package:</p>
<p><img alt="class diagram" src="../../images/vector-class-hierarchy.png"
/></p>
<p><a name="LogisticRegression-SGDLearning"></a></p>
-<h2 id="sgd-learning">SGD Learning</h2>
+<h2 id="sgd-learning">SGD Learning<a class="headerlink" href="#sgd-learning"
title="Permanent link">¶</a></h2>
<p>For the simplest applications, you can construct an
OnlineLogisticRegression and be off and running. Typically, though, it is
nice to have running estimates of performance on held out data. To do
@@ -353,11 +365,11 @@ so that you don't have to.</p>
the number of twiddlable knobs is pretty large. For some examples, see the
<a
href="https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainNewsGroups.java">TrainNewsGroups</a>
example code.</p>
<p><img alt="sgd class diagram" src="../../images/sgd-class-hierarchy.png"
/></p>
-<h2 id="references">References</h2>
+<h2 id="references">References<a class="headerlink" href="#references"
title="Permanent link">¶</a></h2>
<p>[1] <a
href="http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en">Thesis
of
Paul Komarek</a></p>
<p>[2] <a
href="http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/">An
Introduction To Mahout's Logistic Regression SGD Classifier</a></p>
-<h2 id="examples">Examples</h2>
+<h2 id="examples">Examples<a class="headerlink" href="#examples"
title="Permanent link">¶</a></h2>
<p>[3] <a
href="http://mahout.apache.org/users/classification/bankmarketing-example.html">SGD
Bank Marketing Example</a></p>
<p>[4] <a
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh">SGD
20 newsgroups classification</a></p>
</div>
Modified: websites/staging/mahout/trunk/content/users/classification/mlp.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/classification/mlp.html
(original)
+++ websites/staging/mahout/trunk/content/users/classification/mlp.html Fri Apr
8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,27 +264,38 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="multilayer-perceptron">Multilayer Perceptron</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="multilayer-perceptron">Multilayer Perceptron<a class="headerlink"
href="#multilayer-perceptron" title="Permanent link">¶</a></h1>
<p>A multilayer perceptron is a biologically inspired feed-forward network
that can
be trained to represent a nonlinear mapping between input and output data. It
consists of multiple layers, each containing multiple artificial neuron units
and
can be used for classification and regression tasks in a supervised learning
approach. </p>
-<h2 id="command-line-usage">Command line usage</h2>
+<h2 id="command-line-usage">Command line usage<a class="headerlink"
href="#command-line-usage" title="Permanent link">¶</a></h2>
<p>The MLP implementation is currently located in the MapReduce-Legacy
package. It
can be used with the following commands: </p>
-<h1 id="model-training">model training</h1>
+<h1 id="model-training">model training<a class="headerlink"
href="#model-training" title="Permanent link">¶</a></h1>
<div class="codehilite"><pre>$ <span class="n">bin</span><span
class="o">/</span><span class="n">mahout</span> <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">mahout</span><span class="p">.</span><span
class="n">classifier</span><span class="p">.</span><span
class="n">mlp</span><span class="p">.</span><span
class="n">TrainMultilayerPerceptron</span>
</pre></div>
-<h1 id="model-usage">model usage</h1>
+<h1 id="model-usage">model usage<a class="headerlink" href="#model-usage"
title="Permanent link">¶</a></h1>
<div class="codehilite"><pre>$ <span class="n">bin</span><span
class="o">/</span><span class="n">mahout</span> <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">mahout</span><span class="p">.</span><span
class="n">classifier</span><span class="p">.</span><span
class="n">mlp</span><span class="p">.</span><span
class="n">RunMultilayerPerceptron</span>
</pre></div>
<p>To train and use the model, a number of parameters can be specified.
Parameters without default values have to be specified by the user. Consider
that not all parameters can be used both for training and running the model. We
give an example of the usage below.</p>
-<h3 id="parameters">Parameters</h3>
-<table>
+<h3 id="parameters">Parameters<a class="headerlink" href="#parameters"
title="Permanent link">¶</a></h3>
+<table class="table">
<thead>
<tr>
<th align="left">Command</th>
@@ -373,10 +385,10 @@ can be used with the following commands:
</tr>
</tbody>
</table>
-<h2 id="example-usage">Example usage</h2>
+<h2 id="example-usage">Example usage<a class="headerlink"
href="#example-usage" title="Permanent link">¶</a></h2>
<p>In this example, we will train a multilayer perceptron for classification
on the iris data set. The iris flower data set contains data of three flower
species where each datapoint consists of four features.
The dimensions of the data set are given through some flower parameters (sepal
length, sepal width, ...). All samples contain a label that indicates the
flower species they belong to.</p>
-<h3 id="training">Training</h3>
+<h3 id="training">Training<a class="headerlink" href="#training"
title="Permanent link">¶</a></h3>
<p>To train our multilayer perceptron model from the command line, we call the
following command</p>
<div class="codehilite"><pre>$ <span class="n">bin</span><span
class="o">/</span><span class="n">mahout</span> <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">mahout</span><span class="p">.</span><span
class="n">classifier</span><span class="p">.</span><span
class="n">mlp</span><span class="p">.</span><span
class="n">TrainMultilayerPerceptron</span> <span class="o">\</span>
<span class="o">-</span><span class="nb">i</span> <span
class="o">./</span><span class="n">mrlegacy</span><span class="o">/</span><span
class="n">src</span><span class="o">/</span><span class="n">test</span><span
class="o">/</span><span class="n">resources</span><span class="o">/</span><span
class="n">iris</span><span class="p">.</span><span class="n">csv</span> <span
class="o">-</span><span class="n">sh</span> <span class="o">\</span>
@@ -396,7 +408,7 @@ The dimensions of the data set are given
<li><code>-m 0.35</code> momemtum weight is set to <code>0.35</code></li>
<li><code>-r 0.0001</code> regularization weight is set to
<code>0.0001</code></li>
</ul>
-<table>
+<table class="table">
<thead>
<tr>
<th></th>
@@ -410,7 +422,7 @@ The dimensions of the data set are given
</tr>
</tbody>
</table>
-<h3 id="testing">Testing</h3>
+<h3 id="testing">Testing<a class="headerlink" href="#testing" title="Permanent
link">¶</a></h3>
<p>To test / run the multilayer perceptron classification on the trained
model, we can use the following command</p>
<div class="codehilite"><pre>$ <span class="n">bin</span><span
class="o">/</span><span class="n">mahout</span> <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">mahout</span><span class="p">.</span><span
class="n">classifier</span><span class="p">.</span><span
class="n">mlp</span><span class="p">.</span><span
class="n">RunMultilayerPerceptron</span> <span class="o">\</span>
<span class="o">-</span><span class="nb">i</span> <span
class="o">./</span><span class="n">mrlegacy</span><span class="o">/</span><span
class="n">src</span><span class="o">/</span><span class="n">test</span><span
class="o">/</span><span class="n">resources</span><span class="o">/</span><span
class="n">iris</span><span class="p">.</span><span class="n">csv</span> <span
class="o">-</span><span class="n">sh</span> <span class="o">-</span><span
class="n">cr</span> 0 3 <span class="o">\</span>
@@ -426,7 +438,7 @@ The dimensions of the data set are given
<li><code>-mo /tmp/model.model</code> specify where the model file is
stored</li>
<li><code>-o /tmp/labelResult.txt</code> specify where the labeled output file
will be stored</li>
</ul>
-<h2 id="implementation">Implementation</h2>
+<h2 id="implementation">Implementation<a class="headerlink"
href="#implementation" title="Permanent link">¶</a></h2>
<p>The Multilayer Perceptron implementation is based on a more general Neural
Network class. Command line support was added later on and provides a simple
usage of the MLP as shown in the example. It is implemented to run on a single
machine using stochastic gradient descent where the weights are updated using
one datapoint at a time, resulting in a weight update of the form:
$$ \vec{w}^{(t + 1)} = \vec{w}^{(t)} - n \Delta E_n(\vec{w}^{(t)}) $$</p>
<p>where <em>a</em> is the activation of the unit. It is not yet possible to
change the learning to more advanced methods using adaptive learning rates yet.
</p>
@@ -435,7 +447,7 @@ Currently, the logistic sigmoid is used
<p>$$ \frac{1}{1 + exp(-a)} $$</p>
<p>The command line version <strong>does not perform iterations</strong> which
leads to bad results on small datasets. Another restriction is, that the CLI
version of the MLP only supports classification, since the labels have to be
given explicitly when executing on the command line. </p>
<p>A learned model can be stored and updated with new training instanced using
the <code>--update</code> flag. Output of classification reults is saved as a
.txt-file and only consists of the assigned labels. Apart from the command-line
interface, it is possible to construct and compile more specialized neural
networks using the API and interfaces in the mrlegacy package. </p>
-<h2 id="theoretical-background">Theoretical Background</h2>
+<h2 id="theoretical-background">Theoretical Background<a class="headerlink"
href="#theoretical-background" title="Permanent link">¶</a></h2>
<p>The <em>multilayer perceptron</em> was inspired by the biological structure
of the brain where multiple neurons are connected and form columns and layers.
Perceptual input enters this network through our sensory organs and is then
further processed into higher levels.
The term multilayer perceptron is a little misleading since the
<em>perceptron</em> is a special case of a single <em>artificial neuron</em>
that can be used for simple computations <a
href="http://en.wikipedia.org/wiki/Perceptron" title="The perceptron in
wikipedia">[1]</a>. The difference is that the perceptron uses a discontinous
nonlinearity while for the MLP neurons that are implemented in mahout it is
important to use continous nonlinearities. This is necessary for the
implemented learning algorithm, where the error is propagated back from the
output layer to the input layer and the weights of the connections are changed
according to their contribution to the overall error. This algorithm is called
backpropagation and uses gradient descent to update the weights. To compute the
gradients we need continous nonlinearities. But let's start from the
beginning!</p>
<p>The first layer of the MLP represents the input and has no other purpose
than routing the input to every connected unit in a feed-forward fashion.
Following layers are called hidden layers and the last layer serves the special
purpose to determine the output. The activation of a unit <em>u</em> in a
hidden layer is computed through a weighted sum of all inputs, resulting in
Modified:
websites/staging/mahout/trunk/content/users/classification/naivebayes.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/classification/naivebayes.html
(original)
+++ websites/staging/mahout/trunk/content/users/classification/naivebayes.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,8 +264,19 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="NaiveBayes-NaiveBayes"></a></p>
-<h1 id="naive-bayes">Naive Bayes</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="NaiveBayes-NaiveBayes"></a></p>
+<h1 id="naive-bayes">Naive Bayes<a class="headerlink" href="#naive-bayes"
title="Permanent link">¶</a></h1>
<p>Naive Bayes is an algorithm that can be used to classify objects into
usually binary categories. It is one of the most common learning algorithms
in spam filters. Despite its simplicity and rather naive assumptions it has
@@ -285,11 +297,11 @@ features of an objects are considered in
given the phrase "Statue of Liberty" was already found in a text, does not
influence the probability of seeing the phrase "New York" as well.</p>
<p><a name="NaiveBayes-StrategyforaparallelNaiveBayes"></a></p>
-<h2 id="strategy-for-a-parallel-naive-bayes">Strategy for a parallel Naive
Bayes</h2>
+<h2 id="strategy-for-a-parallel-naive-bayes">Strategy for a parallel Naive
Bayes<a class="headerlink" href="#strategy-for-a-parallel-naive-bayes"
title="Permanent link">¶</a></h2>
<p>See <a
href="https://issues.apache.org/jira/browse/MAHOUT-9">https://issues.apache.org/jira/browse/MAHOUT-9</a>
.</p>
<p><a name="NaiveBayes-Examples"></a></p>
-<h2 id="examples">Examples</h2>
+<h2 id="examples">Examples<a class="headerlink" href="#examples"
title="Permanent link">¶</a></h2>
<p><a href="20newsgroups.html">20Newsgroups</a>
- Example code showing how to train and use the Naive Bayes classifier
using the 20 Newsgroups data available at
[http://people.csail.mit.edu/jrennie/20Newsgroups/]</p>
Modified:
websites/staging/mahout/trunk/content/users/classification/neural-network.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/neural-network.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/neural-network.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,17 +264,28 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p><a name="NeuralNetwork-NeuralNetworks"></a></p>
-<h1 id="neural-networks">Neural Networks</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="NeuralNetwork-NeuralNetworks"></a></p>
+<h1 id="neural-networks">Neural Networks<a class="headerlink"
href="#neural-networks" title="Permanent link">¶</a></h1>
<p>Neural Networks are a means for classifying multi dimensional objects. We
concentrate on implementing back propagation networks with one hidden layer
as these networks have been covered by the <a
href="http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf">2006
NIPS map reduce paper</a>
. Those networks are capable of learning not only linear separating hyper
planes but arbitrary decision boundaries.</p>
<p><a name="NeuralNetwork-Strategyforparallelbackpropagationnetwork"></a></p>
-<h2 id="strategy-for-parallel-backpropagation-network">Strategy for parallel
backpropagation network</h2>
+<h2 id="strategy-for-parallel-backpropagation-network">Strategy for parallel
backpropagation network<a class="headerlink"
href="#strategy-for-parallel-backpropagation-network" title="Permanent
link">¶</a></h2>
<p><a name="NeuralNetwork-Designofimplementation"></a></p>
-<h2 id="design-of-implementation">Design of implementation</h2>
+<h2 id="design-of-implementation">Design of implementation<a
class="headerlink" href="#design-of-implementation" title="Permanent
link">¶</a></h2>
</div>
</div>
</div>
Modified:
websites/staging/mahout/trunk/content/users/classification/partial-implementation.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/partial-implementation.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/partial-implementation.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,9 +264,20 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="classifying-with-random-forests">Classifying with random
forests</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="classifying-with-random-forests">Classifying with random forests<a
class="headerlink" href="#classifying-with-random-forests" title="Permanent
link">¶</a></h1>
<p><a name="PartialImplementation-Introduction"></a></p>
-<h1 id="introduction">Introduction</h1>
+<h1 id="introduction">Introduction<a class="headerlink" href="#introduction"
title="Permanent link">¶</a></h1>
<p>This quick start page shows how to build a decision forest using the
partial implementation. This tutorial also explains how to use the decision
forest to classify new data.
@@ -274,9 +286,9 @@ builds a subset of the forest using only
partition. This allows building forests using large datasets as long as
each partition can be loaded in-memory.</p>
<p><a name="PartialImplementation-Steps"></a></p>
-<h1 id="steps">Steps</h1>
+<h1 id="steps">Steps<a class="headerlink" href="#steps" title="Permanent
link">¶</a></h1>
<p><a name="PartialImplementation-Downloadthedata"></a></p>
-<h2 id="download-the-data">Download the data</h2>
+<h2 id="download-the-data">Download the data<a class="headerlink"
href="#download-the-data" title="Permanent link">¶</a></h2>
<ul>
<li>The current implementation is compatible with the UCI repository file
format. In this example we'll use the NSL-KDD dataset because its large
@@ -294,12 +306,12 @@ $HADOOP_HOME/bin/hadoop fs -mkdir testda
$HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata{code}</li>
</ul>
<p><a name="PartialImplementation-BuildtheJobfiles"></a></p>
-<h2 id="build-the-job-files">Build the Job files</h2>
+<h2 id="build-the-job-files">Build the Job files<a class="headerlink"
href="#build-the-job-files" title="Permanent link">¶</a></h2>
<ul>
<li>In $MAHOUT_HOME/ run: {code}mvn clean install -DskipTests{code}</li>
</ul>
<p><a
name="PartialImplementation-Generateafiledescriptorforthedataset:"></a></p>
-<h2 id="generate-a-file-descriptor-for-the-dataset">Generate a file descriptor
for the dataset:</h2>
+<h2 id="generate-a-file-descriptor-for-the-dataset">Generate a file descriptor
for the dataset:<a class="headerlink"
href="#generate-a-file-descriptor-for-the-dataset" title="Permanent
link">¶</a></h2>
<p>run the following command:</p>
<div class="codehilite"><pre>$<span class="n">HADOOP_HOME</span><span
class="o">/</span><span class="n">bin</span><span class="o">/</span><span
class="n">hadoop</span> <span class="n">jar</span>
</pre></div>
@@ -313,7 +325,7 @@ of the data. In this cases, it means 1 n
3 Categorical(C) attributes, ...L indicates the label. You can also use 'I'
to ignore some attributes</p>
<p><a name="PartialImplementation-Runtheexample"></a></p>
-<h2 id="run-the-example">Run the example</h2>
+<h2 id="run-the-example">Run the example<a class="headerlink"
href="#run-the-example" title="Permanent link">¶</a></h2>
<div class="codehilite"><pre>$<span class="n">HADOOP_HOME</span><span
class="o">/</span><span class="n">bin</span><span class="o">/</span><span
class="n">hadoop</span> <span class="n">jar</span>
</pre></div>
@@ -342,7 +354,7 @@ number of partitions.
10/03/13 17:57:33 INFO mapreduce.BuildForest: Storing the forest in:
nsl-forest/forest.seq</p>
<p><a
name="PartialImplementation-UsingtheDecisionForesttoClassifynewdata"></a></p>
-<h2 id="using-the-decision-forest-to-classify-new-data">Using the Decision
Forest to Classify new data</h2>
+<h2 id="using-the-decision-forest-to-classify-new-data">Using the Decision
Forest to Classify new data<a class="headerlink"
href="#using-the-decision-forest-to-classify-new-data" title="Permanent
link">¶</a></h2>
<p>run the following command:</p>
<div class="codehilite"><pre>$<span class="n">HADOOP_HOME</span><span
class="o">/</span><span class="n">bin</span><span class="o">/</span><span
class="n">hadoop</span> <span class="n">jar</span>
</pre></div>
@@ -387,7 +399,7 @@ if a directory containing for example tw
the output will be a directory 'predictions' containing two files
'a.data.out' and 'b.data.out'</p>
<p><a name="PartialImplementation-KnownIssuesandlimitations"></a></p>
-<h2 id="known-issues-and-limitations">Known Issues and limitations</h2>
+<h2 id="known-issues-and-limitations">Known Issues and limitations<a
class="headerlink" href="#known-issues-and-limitations" title="Permanent
link">¶</a></h2>
<p>The "Decision Forest" code is still "a work in progress", many features are
still missing. Here is a list of some known issues:
<em> For now, the training does not support multiple input files. The input
Modified:
websites/staging/mahout/trunk/content/users/classification/random-forests.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/random-forests.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/random-forests.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
Modified:
websites/staging/mahout/trunk/content/users/classification/restricted-boltzmann-machines.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/restricted-boltzmann-machines.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/restricted-boltzmann-machines.html
Fri Apr 8 18:41:08 2016
@@ -146,6 +146,7 @@
<li class="nav-header">Engines</li>
<li><a href="/users/sparkbindings/home.html">Spark</a></li>
<li><a
href="/users/environment/h2o-internals.html">H2O</a></li>
+ <li><a href="/users/flinkbindings/home.html">Flink</a></li>
<li class="nav-header">References</li>
<li><a
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL
Reference</a></li>
<li><a
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL
Reference</a></li>
@@ -263,13 +264,24 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <ol>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<ol>
<li></li>
</ol>
<p>The JIRA issue is <a
href="https://issues.apache.org/jira/browse/MAHOUT-375">here</a>
. </p>
<p><a name="RestrictedBoltzmannMachines-BoltzmannMachines"></a></p>
-<h3 id="boltzmann-machines">Boltzmann Machines</h3>
+<h3 id="boltzmann-machines">Boltzmann Machines<a class="headerlink"
href="#boltzmann-machines" title="Permanent link">¶</a></h3>
<p>Boltzmann Machines are a type of stochastic neural networks that closely
resemble physical processes. They define a network of units with an overall
energy that is evolved over a period of time, until it reaches thermal
@@ -277,7 +289,7 @@ equilibrium. </p>
<p>However, the convergence speed of Boltzmann machines that have
unconstrained connectivity is low.</p>
<p><a name="RestrictedBoltzmannMachines-RestrictedBoltzmannMachines"></a></p>
-<h3 id="restricted-boltzmann-machines">Restricted Boltzmann Machines</h3>
+<h3 id="restricted-boltzmann-machines">Restricted Boltzmann Machines<a
class="headerlink" href="#restricted-boltzmann-machines" title="Permanent
link">¶</a></h3>
<p>Restricted Boltzmann Machines are a variant, that are 'restricted' in the
sense that connections between hidden units of a single layer are <em>not</em>
allowed. In addition, stacking multiple RBM's is also feasible, with the
@@ -287,7 +299,7 @@ parallelization. </p>
<p>In the Netflix Prize, RBM's offered distinctly orthogonal predictions to
SVD and k-NN approaches, and contributed immensely to the final solution.</p>
<p><a name="RestrictedBoltzmannMachines-RBM'sinApacheMahout"></a></p>
-<h3 id="rbms-in-apache-mahout">RBM's in Apache Mahout</h3>
+<h3 id="rbms-in-apache-mahout">RBM's in Apache Mahout<a class="headerlink"
href="#rbms-in-apache-mahout" title="Permanent link">¶</a></h3>
<p>An implementation of Restricted Boltzmann Machines is being developed for
Apache Mahout as a Google Summer of Code 2010 project. A recommender
interface will also be provided. The key aims of the implementation are: