Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Logistic Regression 
(https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression)


Edited by Federico Castanedo:
---------------------------------------------------------------------
h1. Logistic Regression

Logistic regression is a model used for prediction of the probability of 
occurrence of an event. It makes use of several predictor variables that may be 
either numerical or categories.

Logistic regression is the standard industry workhorse that underlies many 
production fraud detection and advertising quality and targeting products.  The 
Mahout implementation uses Stochastic Gradient Descent (SGD) to all large 
training sets to be used.

For a more detailed analysis of the approach, have a look at the thesis of Paul 
Komarek:

http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en


h2. Parallelization strategy

The bad news is that SGD is an inherently sequential algorithm.  The good news 
is that it is blazingly fast and thus it is not a problem for Mahout's 
implementation to handle training sets of tens of millions of examples.  With 
the down-sampling typical in many data-sets, this is equivalent to a dataset 
with billions of raw training examples.

The SGD system in Mahout is an online learning algorithm which means that you 
can learn models in an incremental fashion and that you can do performance 
testing as your system runs.  Often this means that you can stop training when 
a model reaches a target level of performance.  The SGD framework includes 
classes to do on-line evaluation using cross validation (the CrossFoldLearner) 
and an evolutionary system to do learning hyper-parameter optimization on the 
fly (the AdaptiveLogisticRegression).  The AdaptiveLogisticRegression system 
makes heavy use of threads to increase machine utilization.  The way it works 
is that it runs 20 CrossFoldLearners in separate threads, each with slightly 
different learning parameters.  As better settings are found, these new 
settings are propagating to the other learners.

h2. Design of packages

There are three packages that are used in Mahout's SGD system.  These include

* The vector encoding package (found in org.apache.mahout.vectorizer.encoders)

* The SGD learning package (found in org.apache.mahout.classifier.sgd)

* The evolutionary optimization system (found in org.apache.mahout.ep)

h3. Feature vector encoding

Because the SGD algorithms need to have fixed length feature vectors and 
because it is a pain to build a dictionary ahead of time, most SGD applications 
use the hashed feature vector encoding system that is rooted at 
FeatureVectorEncoder.

The basic idea is that you create a vector, typically a 
RandomAccessSparseVector, and then you use various feature encoders to 
progressively add features to that vector.  The size of the vector should be 
large enough to avoid feature collisions as features are hashed.

There are specialized encoders for a variety of data types.  You can normally 
encode either a string representation of the value you want to encode or you 
can encode a byte level representation to avoid string conversion.  In the case 
of ContinuousValueEncoder and ConstantValueEncoder, it is also possible to 
encode a null value and pass the real value in as a weight.  This avoids 
numerical parsing entirely in case you are getting your training data from a 
system like Avro.

Here is a class diagram for the encoders package:

!vector-class-hierarchy.png|border=1!

h3. SGD Learning

For the simplest applications, you can construct an OnlineLogisticRegression 
and be off and running.  Typically, though, it is nice to have running 
estimates of performance on held out data.  To do that, you should use a 
CrossFoldLearner which keeps a stable of five (by default) 
OnlineLogisticRegression objects.  Each time you pass a training example to a 
CrossFoldLearner, it passes this example to all but one of its children as 
training passes the example to the last child to evaluate current performance.  
The children are used for evaluation in a round-robin fashion so, if you are 
using the default 5 way split, all of the children get 80% of the training data 
for training and get 20% of the data for evaluation.

To avoid the pesky need to configure learning rates, regularization parameters 
and annealing schedules, you can use the AdaptiveLogisticRegression.  This 
class maintains a pool of CrossFoldLearners and adapts learning rates and 
regularization on the fly so that you don't have to.

Here is a class diagram for the classifiers.sgd package.  As you can see, the 
number of twiddlable knobs is pretty large.  For some examples, see the 
TrainNewsGroups example code.

!sgd-class-hierarchy.png|border=1!



Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to