Random Forests (MAHOUT) edited by abdelhakim deneche Page: http://cwiki.apache.org/confluence/display/MAHOUT/Random+Forests Changes: http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=112565&originalVersion=3&revisedVersion=4
Content: --------------------------------------------------------------------- h3. How to grow a Decision Tree source : \[3\] LearnUnprunedTree(*X*,*Y*) Input: *X* a matrix of *R* rows and *M* columns where *X{*}{*}{~}ij{~}* = the value of the *j*'th attribute in the *i*'th input datapoint. Each column consists of either all real values or all categorical values. Input: *Y* a vector of *R* elements, where *Y{*}{*}{~}i{~}* = the output class of the *i*'th datapoint. The *Y{*}{*}{~}i{~}* values are categorical. Output: An Unpruned decision tree If all records in *X* have identical values in all their attributes (this includes the case where *R<2*), return a Leaf Node predicting the majority output, breaking ties randomly. This case also includes If all values in *Y* are the same, return a Leaf Node predicting this value as the output Else select *m* variables at random out of the *M* variables For *j* = 1 .. *m* If *j*'th attribute is categorical * IG{*}{*}{~}j{~}* = IG(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain) Else (*j*'th attribute is real-valued) * IG{*}{*}{~}j{~}* = IG*(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain) Let *j\** = argmax{~}j~ *IG{*}{*}{~}j{~}* (this is the splitting attribute we'll use) If *j\** is categorical then For each value *v* of the *j*'th attribute Let *X{*}{*}{^}v{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* = *v*. Let *Y{*}{*}{^}v{^}* = corresponding subset of *Y* Let *Child{*}{*}{^}v{^}* = LearnUnprunedTree(*X{*}{*}{^}v{^}*,*Y{*}{*}{^}v{^}*) Return a decision tree node, splitting on *j*'th attribute. The number of children equals the number of values of the *j*'th attribute, and the *v*'th child is *Child{*}{*}{^}v{^}* Else *j\** is real-valued and let *t* be the best split threshold Let *X{*}{*}{^}LO{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* *<= t*. Let *Y{*}{*}{^}LO{^}* = corresponding subset of *Y* Let *Child{*}{*}{^}LO{^}* = LearnUnprunedTree(*X{*}{*}{^}LO{^}*,*Y{*}{*}{^}LO{^}*) Let *X{*}{*}{^}HI{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* *> t*. Let *Y{*}{*}{^}HI{^}* = corresponding subset of *Y* Let *Child{*}{*}{^}HI{^}* = LearnUnprunedTree(*X{*}{*}{^}HI{^}*,*Y{*}{*}{^}HI{^}*) Return a decision tree node, splitting on *j*'th attribute. It has two children corresponding to whether the *j*'th attribute is above or below the given threshold. *Note*: There are alternatives to Information Gain for splitting nodes h3. Information gain source : \[3\] # h4. nominal attributes suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~ P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~ H(X)= \-sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X) H(Y\|X=v) = the entropy of Y among only those records in which X has value v H(Y\|X) = sum{~}j~ p{~}j~ H(Y\|X=v{~}j~) IG(Y\|X) = H(Y) - H(Y\|X) # h4. real-valued attributes suppose X is real valued define IG(Y\|X:t) as H(Y) - H(Y\|X:t) define H(Y\|X:t) = H(Y\|X<t) P(X<t) + H(Y\|X>=t) P(X>=t) define IG*(Y\|X) = max{~}t~ IG(Y\|X:t) h3. How to grow a Random Forest source : \[1\] Each tree is grown as follows: # if the number of cases in the training set is *N*, sample *N* cases at random \-but with replacement, from the original data. This sample will be the training set for the growing tree. # if there are *M* input variables, a number *m << M* is specified such that at each node, *m* variables are selected at random out of the *M* and the best split on these *m* is used to split the node. The value of *m* is held constant during the forest growing. # each tree is grown to its large extent possible. There is no pruning. h3. Random Forest parameters source : \[2\] Random Forests are easy to use, the only 2 parameters a user of the technique has to determine are the number of trees to be used and the number of variables (*m*) to be randomly selected from the available set of variables. Breinman's recommendations are to pick a large number of trees, as well as the square root of the number of variables for *m*. h3. How to predict the label of a case Classify(*node*,*V*) Input: *node* from the decision tree, if *node.attribute = j* then the split is done on the *j*'th attribute Input: *V* a vector of *M* columns where *V{*}{*}{~}j{~}* = the value of the *j*'th attribute. Output: label of *V* If *node* is a Leaf then Return the value predicted by *node* Else Let *j = node.attribute* If *j* is categorical then Let *v* = *V{*}{*}{~}j{~}* Let *child{*}{*}{^}v{^}* = child node corresponding to the attribute's value *v* Return Classify(*child{*}{*}{^}v{^}*,*V*) Else *j* is real-valued Let *t = node.threshold* (split threshold) If Vj < t then Let *child{*}{*}{^}LO{^}* = child node corresponding to (*<t*) Return Classify(*child{*}{*}{^}LO{^}*,*V*) Else Let *child{*}{*}{^}HI{^}* = child node corresponding to (*>=t*) Return Classify(*child{*}{*}{^}HI{^}*,*V*) h3. The out of bag (oob) error estimation source : \[1\] in random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows: * each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases left of the bootstrap sample and not used in the construction of the _kth_ tree. * put each case left out in the construction of the _kth_ tree down the _kth{_}tree to get a classification. In this way, a test set classification is obtained for each case in about one-thrid of the trees. At the end of the run, take *j* to be the class that got mort of the the votes every time case *n* was _oob_. The proportion of times that *j* is not equal to the true class of *n* averaged over all cases is the _oob error estimate_. This has proven to be unbiased in many tests. h3. Other RF uses source : \[1\] * variable importance * gini importance * proximities * scaling * prototypes * missing values replacement for the training set * missing values replacement for the test set * detecting mislabeled cases * detecting outliers * detecting novelties * unsupervised learning * balancing prediction error Please refer to \[1\] for a detailed description h3. References \[1\] Random Forests - Classification Description [http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm] \[2\] B. Larivière & D. Van Den Poel, 2004. "Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques," Working Papers of Faculty of Economics and Business Administration, Ghent University, Belgium 04/282, Ghent University, Faculty of Economics and Business Administration. Available online : [http://ideas.repec.org/p/rug/rugwps/04-282.html] \[3\] Decision Trees - Andrew W. Moore\[4\] http://www.cs.cmu.edu/~awm/tutorials\[1\] \[4\] Information Gain - Andrew W. Moore [http://www.cs.cmu.edu/~awm/tutorials] --------------------------------------------------------------------- CONFLUENCE INFORMATION This message is automatically generated by Confluence Unsubscribe or edit your notifications preferences http://cwiki.apache.org/confluence/users/viewnotifications.action If you think it was sent incorrectly contact one of the administrators http://cwiki.apache.org/confluence/administrators.action If you want more information on Confluence, or have a bug to report see http://www.atlassian.com/software/confluence