Author: buildbot
Date: Sun Mar 29 18:54:05 2015
New Revision: 945539
Log:
Staging update by buildbot for mahout
Modified:
websites/staging/mahout/trunk/content/ (props changed)
websites/staging/mahout/trunk/content/users/classification/logistic-regression.html
Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Sun Mar 29 18:54:05 2015
@@ -1 +1 @@
-1669854
+1669950
Modified:
websites/staging/mahout/trunk/content/users/classification/logistic-regression.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/classification/logistic-regression.html
(original)
+++
websites/staging/mahout/trunk/content/users/classification/logistic-regression.html
Sun Mar 29 18:54:05 2015
@@ -261,8 +261,10 @@ production fraud detection and advertisi
The Mahout implementation uses Stochastic Gradient Descent (SGD) to all
large training sets to be used.</p>
<p>For a more detailed analysis of the approach, have a look at the <a
href="http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en">thesis
of
-Paul Komarek</a>.</p>
+Paul Komarek</a> [1].</p>
<p>See MAHOUT-228 for the main JIRA issue for SGD.</p>
+<p>A more detailed overview of the Mahout Linear Regression classifier and <a
href="http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/">detailed
discription of building a Logistic Regression classifier</a> for the classic
<a href="http://en.wikipedia.org/wiki/Iris_flower_data_set">Iris flower
dataset</a> is also available [2]. </p>
+<p>An example of using training a Logistic Regression classifier for the <a
href="http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing">UCI Bank Marketing
Dataset</a> can be found <a
href="http://mahout.apache.org/users/classification/bankmarketing-example.html">on
the Mahout website</a> [3].</p>
<p><a name="LogisticRegression-Parallelizationstrategy"></a></p>
<h2 id="parallelization-strategy">Parallelization strategy</h2>
<p>The bad news is that SGD is an inherently sequential algorithm. The good
@@ -298,7 +300,7 @@ include</p>
</li>
</ul>
<p><a name="LogisticRegression-Featurevectorencoding"></a></p>
-<h3 id="feature-vector-encoding">Feature vector encoding</h3>
+<h2 id="feature-vector-encoding">Feature vector encoding</h2>
<p>Because the SGD algorithms need to have fixed length feature vectors and
because it is a pain to build a dictionary ahead of time, most SGD
applications use the hashed feature vector encoding system that is rooted
@@ -317,7 +319,7 @@ case you are getting your training data
<p>Here is a class diagram for the encoders package:</p>
<p><img alt="class diagram" src="../../images/vector-class-hierarchy.png"
/></p>
<p><a name="LogisticRegression-SGDLearning"></a></p>
-<h3 id="sgd-learning">SGD Learning</h3>
+<h2 id="sgd-learning">SGD Learning</h2>
<p>For the simplest applications, you can construct an
OnlineLogisticRegression and be off and running. Typically, though, it is
nice to have running estimates of performance on held out data. To do
@@ -338,6 +340,12 @@ so that you don't have to.</p>
the number of twiddlable knobs is pretty large. For some examples, see the
TrainNewsGroups example code.</p>
<p><img alt="sgd class diagram" src="../../images/sgd-class-hierarchy.png"
/></p>
+<h2 id="references">References</h2>
+<p>[1] <a
href="http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en">Thesis
of
+Paul Komarek</a></p>
+<p>[2] <a
href="http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/">An
Introduction To Mahout's Logistic Regression SGD Classifier</a></p>
+<h2 id="examples">Examples</h2>
+<p>[3] <a
href="http://mahout.apache.org/users/classification/bankmarketing-example.html">SGD
Bank Marketing Example</a></p>
</div>
</div>
</div>