Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Logistic Regression 
(https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression)


Edited by Ted Dunning:
---------------------------------------------------------------------
h1. Logistic Regression

Logistic regression is a model used for prediction of the probability of 
occurrence of an event. It makes use of several predictor variables that may be 
either numerical or categories.

Logistic regression is the a standard industry workhorse that underlies many 
production fraud detection and advertising quality and targeting products.  The 
Mahout implementation uses Stochastic Gradient Descent (SGD) to all large 
training sets to be used.

For a more detailed analysis of the approach, have a look at the thesis of Paul 
Komarek:

http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en


h2. Parallelization strategy

The bad news is that SGD is an inherently sequential algorithm.  The good news 
is that it is blazingly fast and thus it is not a problem for Mahout's 
implementation to handle training sets of tens of millions of examples.  With 
the down-sampling typical in many data-sets, this is equivalent to a dataset 
with billions of raw training examples.

The SGD system in Mahout is an online learning algorithm which means that you 
can learn models in an incremental fashion and that you can do performance 
testing as your system runs.  Often this means that you can stop training when 
a model reaches a target level of performance.  The SGD framework includes 
classes to do on-line evaluation using cross validation (the CrossFoldLearner) 
and an evolutionary system to do learning hyper-parameter optimization on the 
fly (the AdaptiveLogisticRegression).  The AdaptiveLogisticRegression system 
makes heavy use of threads to increase machine utilization.  The way it works 
is that it runs 20 CrossFoldLearners in separate threads, each with slightly 
different learning parameters.  As better settings are found, these new 
settings are propagating to the other learners.

h2. Design of packages

There are three packages that are used in Mahout's SGD system.  These include

* The vector encoding package (found in org.apache.mahout.vectorizer.encoders)

* The SGD learning package (found in org.apache.mahout.classifier.sgd)

* The evolutionary optimization system (found in org.apache.mahout.ep)

h3. Feature vector encoding

Because the SGD algorithms need to have fixed length feature vectors and 
because it is a pain to build a dictionary ahead of time, most SGD applications 
use the hashed feature vector encoding system that is rooted at 
FeatureValueEncoder.


Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to