[CONF] Apache Lucene Mahout: Random Forests (page edited)

confluence Tue, 17 Mar 2009 02:21:37 -0700

Random Forests (MAHOUT) edited by abdelhakim deneche
      Page: http://cwiki.apache.org/confluence/display/MAHOUT/Random+Forests
   Changes: 
http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=112565&originalVersion=3&revisedVersion=4







Content:
---------------------------------------------------------------------

h3. How to grow a Decision Tree

source : \[3\]

LearnUnprunedTree(*X*,*Y*)

Input: *X* a matrix of *R* rows and *M* columns where *X{*}{*}{~}ij{~}* = the 
value of the *j*'th attribute in the *i*'th input datapoint. Each column 
consists of either all real values or all categorical values.
Input: *Y* a vector of *R* elements, where *Y{*}{*}{~}i{~}* = the output class 
of the *i*'th datapoint. The *Y{*}{*}{~}i{~}* values are categorical.
Output: An Unpruned decision tree


If all records in *X* have identical values in all their attributes (this 
includes the case where *R<2*), return a Leaf Node predicting the majority 
output, breaking ties randomly. This case also includes
If all values in *Y* are the same, return a Leaf Node predicting this value as 
the output
Else
&nbsp;&nbsp;&nbsp; select *m* variables at random out of the *M* variables
&nbsp;&nbsp;&nbsp; For *j* = 1 .. *m*
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j*'th attribute is categorical
*&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; IG{*}{*}{~}j{~}* 
= IG(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain)&nbsp;&nbsp; 
&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else (*j*'th attribute is 
real-valued)
*&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
IG{*}{*}{~}j{~}* = IG*(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain)
&nbsp;&nbsp;&nbsp; Let *j\** = argmax{~}j~ *IG{*}{*}{~}j{~}* (this is the 
splitting attribute we'll use)
&nbsp;&nbsp;&nbsp; If *j\** is categorical then
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; For each value *v* of the *j*'th 
attribute
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let 
*X{*}{*}{^}v{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* = *v*. Let 
*Y{*}{*}{^}v{^}* = corresponding subset of *Y*
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}v{^}* = 
LearnUnprunedTree(*X{*}{*}{^}v{^}*,*Y{*}{*}{^}v{^}*)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Return a decision tree node, 
splitting on *j*'th attribute. The number of children equals the number of 
values of the *j*'th attribute, and the *v*'th child is *Child{*}{*}{^}v{^}*
&nbsp;&nbsp;&nbsp; Else *j\** is real-valued and let *t* be the best split 
threshold
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *X{*}{*}{^}LO{^}* = subset of 
rows of *X* in which *X{*}{*}{~}ij{~}* *<= t*. Let *Y{*}{*}{^}LO{^}* = 
corresponding subset of *Y*
&nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}LO{^}* = 
LearnUnprunedTree(*X{*}{*}{^}LO{^}*,*Y{*}{*}{^}LO{^}*)
&nbsp; &nbsp; &nbsp; &nbsp; Let *X{*}{*}{^}HI{^}* = subset of rows of *X* in 
which *X{*}{*}{~}ij{~}* *> t*. Let *Y{*}{*}{^}HI{^}* = corresponding subset of 
*Y*
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *Child{*}{*}{^}HI{^}* = 
LearnUnprunedTree(*X{*}{*}{^}HI{^}*,*Y{*}{*}{^}HI{^}*)
&nbsp; &nbsp; &nbsp; &nbsp; Return a decision tree node, splitting on *j*'th 
attribute. It has two children corresponding to whether the *j*'th attribute is 
above or below the given threshold.

*Note*: There are alternatives to Information Gain for splitting nodes
&nbsp;

h3. Information gain

source : \[3\]
# h4. nominal attributes

suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~
P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~
&nbsp;
H(X)= \-sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X)
H(Y\|X=v) = the entropy of Y among only those records in which X has value v
H(Y\|X) = sum{~}j~ p{~}j~ H(Y\|X=v{~}j~)
IG(Y\|X) = H(Y) - H(Y\|X)
# h4. real-valued attributes

suppose X is real valued
define IG(Y\|X:t) as H(Y) - H(Y\|X:t)
define H(Y\|X:t) = H(Y\|X<t) P(X<t) + H(Y\|X>=t) P(X>=t)
define IG*(Y\|X) = max{~}t~ IG(Y\|X:t)

h3. How to grow a Random Forest

source : \[1\]

Each tree is grown as follows:
# if the number of cases in the training set is *N*, sample *N* cases at random 
\-but with replacement, from the original data. This sample will be the 
training set for the growing tree.
# if there are *M* input variables, a number *m << M* is specified such that at 
each node, *m* variables are selected at random out of the *M* and the best 
split on these *m* is used to split the node. The value of *m* is held constant 
during the forest growing.
# each tree is grown to its large extent possible. There is no pruning.

h3. Random Forest parameters

source : \[2\]
Random Forests are easy to use, the only 2 parameters a user of the technique 
has to determine are the number of trees to be used and the number of variables 
(*m*) to be randomly selected from the available set of variables.
Breinman's recommendations are to pick a large number of trees, as well as the 
square root of the number of variables for *m*.
&nbsp;

h3. How to predict the label of a case

Classify(*node*,*V*)
&nbsp;&nbsp;&nbsp; Input: *node* from the decision tree, if *node.attribute = 
j* then the split is done on the *j*'th attribute

&nbsp;&nbsp; &nbsp;Input: *V* a vector of *M* columns where *V{*}{*}{~}j{~}* = 
the value of the *j*'th attribute.
&nbsp;&nbsp;&nbsp; Output: label of *V*

&nbsp;&nbsp;&nbsp; If *node* is a Leaf then
&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; Return the value predicted by 
*node*

&nbsp;&nbsp; &nbsp;Else
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *j = 
node.attribute*
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j* is 
categorical then
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let 
*v* = *V{*}{*}{~}j{~}*
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let 
*child{*}{*}{^}v{^}* = child node corresponding to the attribute's value *v*
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp; 
&nbsp;&nbsp;&nbsp;&nbsp; Return Classify(*child{*}{*}{^}v{^}*,*V*)

&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else *j* is 
real-valued
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *t 
= node.threshold* (split threshold)
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp; 
&nbsp;&nbsp;&nbsp;&nbsp; If Vj < t then
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}LO{^}* = child node 
corresponding to (*<t*)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp; 
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Return 
Classify(*child{*}{*}{^}LO{^}*,*V*)
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp; 
&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}HI{^}* = child 
node corresponding to (*>=t*)
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; Return 
Classify(*child{*}{*}{^}HI{^}*,*V*)
&nbsp;

h3. The out of bag (oob) error estimation

source : \[1\]

in random forests, there is no need for cross-validation or a separate test set 
to get an unbiased estimate of the test set error. It is estimated internally, 
during the run, as follows:
* each tree is constructed using a different bootstrap sample from the original 
data. About one-third of the cases left of the bootstrap sample and not used in 
the construction of the _kth_ tree.
* put each case left out in the construction of the _kth_ tree down the 
_kth{_}tree to get a classification. In this way, a test set classification is 
obtained for each case in about one-thrid of the trees. At the end of the run, 
take *j* to be the class that got mort of the the votes every time case *n* was 
_oob_. The proportion of times that *j* is not equal to the true class of *n* 
averaged over all cases is the _oob error estimate_. This has proven to be 
unbiased in many tests.

h3. Other RF uses

source : \[1\]
* variable importance
* gini importance
* proximities
* scaling
* prototypes
* missing values replacement for the training set
* missing values replacement for the test set
* detecting mislabeled cases
* detecting outliers
* detecting novelties
* unsupervised learning
* balancing prediction error
Please refer to \[1\] for a detailed description

h3. References

\[1\]&nbsp; Random Forests - Classification Description
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; 
&nbsp;[http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]
\[2\]&nbsp; B. Larivière & D. Van Den Poel, 2004. "Predicting Customer 
Retention and Profitability by Using Random Forests and Regression Forests 
Techniques,"
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Working Papers of Faculty of 
Economics and Business Administration, Ghent University, Belgium 04/282, Ghent 
University,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Faculty of Economics and Business 
Administration.
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Available online : 
[http://ideas.repec.org/p/rug/rugwps/04-282.html]
\[3\]&nbsp; Decision Trees - Andrew W. Moore\[4\]
&nbsp; &nbsp; &nbsp; &nbsp; http://www.cs.cmu.edu/~awm/tutorials\[1\]
\[4\]&nbsp; Information Gain - Andrew W. Moore
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
[http://www.cs.cmu.edu/~awm/tutorials]

---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
   http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
   http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
   http://www.atlassian.com/software/confluence

[CONF] Apache Lucene Mahout: Random Forests (page edited)

Reply via email to