Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: Logistic Regression (https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression)
Edited by Ted Dunning: --------------------------------------------------------------------- h1. Logistic Regression Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables that may be either numerical or categories. Logistic regression is the a standard industry workhorse that underlies many production fraud detection and advertising quality and targeting products. The Mahout implementation uses Stochastic Gradient Descent (SGD) to all large training sets to be used. For a more detailed analysis of the approach, have a look at the thesis of Paul Komarek: http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en h2. Parallelization strategy The bad news is that SGD is an inherently sequential algorithm. The good news is that it is blazingly fast and thus it is not a problem for Mahout's implementation to handle training sets of tens of millions of examples. With the down-sampling typical in many data-sets, this is equivalent to a dataset with billions of raw training examples. The SGD system in Mahout is an online learning algorithm which means that you can learn models in an incremental fashion and that you can do performance testing as your system runs. Often this means that you can stop training when a model reaches a target level of performance. The SGD framework includes classes to do on-line evaluation using cross validation (the CrossFoldLearner) and an evolutionary system to do learning hyper-parameter optimization on the fly (the AdaptiveLogisticRegression). The AdaptiveLogisticRegression system makes heavy use of threads to increase machine utilization. The way it works is that it runs 20 CrossFoldLearners in separate threads, each with slightly different learning parameters. As better settings are found, these new settings are propagating to the other learners. h2. Design of packages There are three packages that are used in Mahout's SGD system. These include * The vector encoding package (found in org.apache.mahout.vectorizer.encoders) * The SGD learning package (found in org.apache.mahout.classifier.sgd) * The evolutionary optimization system (found in org.apache.mahout.ep) h3. Feature vector encoding Because the SGD algorithms need to have fixed length feature vectors and because it is a pain to build a dictionary ahead of time, most SGD applications use the hashed feature vector encoding system that is rooted at FeatureValueEncoder. Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
