[ https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800156#action_12800156 ]
Deneche A. Hakim commented on MAHOUT-245: ----------------------------------------- I modified the code to not select Categorical attributes that have been selected in one of the parent nodes. I also modified the BreimanExample to show the mean (relative to all iterations of the example) number of nodes in all the trees of the built forests. I tested on two UCI datasets: * [glass identification dataset | http://archive.ics.uci.edu/ml/datasets/Glass+Identification]: This dataset contains only numerical attributes, hence it should not be affected by the modification. The test runs 100 iterations, each building 100 trees * [poker hand (training) dataset | http://archive.ics.uci.edu/ml/datasets/Poker+Hand]: This dataset contains 10 categorical attributes. the test runs 10 iterations, each building 100 trees The results are (Before the modification): || Dataset || Selection || Single Input || One Tree || Mean RI Time || Mean SI Time || Mean RI num nodes || Mean SI num nodes || | glass | 25.2% | 25.6% | 40.1% | 1s 27ms | 0s 497ms | 6715 | 11419 | | poker | 27.5% | 37.8% | 44.2%| 1m 14s 855ms | 58s 200ms | 1442811 | 2133194 | The results are (After the modification): || Dataset || Selection || Single Input || One Tree || Mean RI Time || Mean SI Time || Mean RI num nodes || Mean SI num nodes || | glass | 22.5% | 22.8% | 39.8% | 0s 935ms | 0s 442ms | 6735 | 11528 | | poker | 27.8% | 38.0% | 42.9% | 53s 24ms | 36s 818ms | 1372914 | 1700049 | The Breiman Example and the meaning of the columns are described [here | http://issues.apache.org/jira/browse/MAHOUT-122?focusedCommentId=12718777&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12718777] * Mean RI num nodes: mean number of nodes in the forest built using Random Selection * Mean SI num nodes: mean number of nodes in the forest built using Single-Input Selection the variations in the error rates are due (I hope) to the randomness in the process. The built times are relative (but note that I'm running Ubuntu inside a VirtualBox). and we can see that the modification effectively reduces the number of nodes in the "poker" dataset. > Better handling of Categorical attributes when building Decision Forests > ------------------------------------------------------------------------ > > Key: MAHOUT-245 > URL: https://issues.apache.org/jira/browse/MAHOUT-245 > Project: Mahout > Issue Type: Improvement > Components: Classification > Affects Versions: 0.3 > Reporter: Deneche A. Hakim > Assignee: Deneche A. Hakim > Fix For: 0.3 > > Attachments: mahout-245.patch > > > When building a decision tree, at each node a random subset from all > variables (attributes) is considered for the node split. > If a Categorical variable has been selected, the data available at the node > is split such that each child node has the same value for the selected > variable. In all sub-nodes the selected variable should not be selected > again, but the current implementation does not account for that. The > resulting tree may contain redundant nodes that does not impair its > classification performance but are nonetheless unnecessary. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.