[ 
https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800156#action_12800156
 ] 

Deneche A. Hakim commented on MAHOUT-245:
-----------------------------------------

I modified the code to not select Categorical attributes that have been 
selected in one of the parent nodes.
I also modified the BreimanExample to show the mean (relative to all iterations 
of the example) number of nodes in all the trees of the built forests.

I tested on two UCI datasets:
* [glass identification dataset | 
http://archive.ics.uci.edu/ml/datasets/Glass+Identification]: This dataset 
contains only numerical attributes, hence it should not be affected by the 
modification. The test runs 100 iterations, each building 100 trees
* [poker hand (training) dataset | 
http://archive.ics.uci.edu/ml/datasets/Poker+Hand]: This dataset contains 10 
categorical attributes. the test runs 10 iterations, each building 100 trees

The results are (Before the modification):

|| Dataset || Selection || Single Input || One Tree || Mean RI Time || Mean SI 
Time || Mean RI num nodes || Mean SI num nodes ||
| glass | 25.2% | 25.6% | 40.1% | 1s 27ms | 0s 497ms | 6715 | 11419 |
| poker | 27.5% | 37.8% | 44.2%| 1m 14s 855ms | 58s 200ms | 1442811 | 2133194 |

The results are (After the modification):

|| Dataset || Selection || Single Input || One Tree || Mean RI Time || Mean SI 
Time || Mean RI num nodes || Mean SI num nodes ||
| glass | 22.5% | 22.8% | 39.8% | 0s 935ms | 0s 442ms | 6735 | 11528 |
| poker | 27.8% | 38.0% | 42.9% | 53s 24ms | 36s 818ms | 1372914 | 1700049 |

The Breiman Example and the meaning of the columns are described [here | 
http://issues.apache.org/jira/browse/MAHOUT-122?focusedCommentId=12718777&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12718777]
* Mean RI num nodes: mean number of nodes in the forest built using Random 
Selection
* Mean SI num nodes: mean number of nodes in the forest built using 
Single-Input Selection

the variations in the error rates are due (I hope) to the randomness in the 
process. The built times are relative (but note that I'm running Ubuntu inside 
a VirtualBox). and we can see that the modification effectively reduces the 
number of nodes in the "poker" dataset.


> Better handling of Categorical attributes when building Decision Forests
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-245
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-245
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>             Fix For: 0.3
>
>         Attachments: mahout-245.patch
>
>
> When building a decision tree, at each node a random subset from all 
> variables (attributes) is considered for the node split.
> If a Categorical variable has been selected, the data available at the node 
> is split such that each child node has the same value for the selected 
> variable. In all sub-nodes the selected variable should not be selected 
> again, but the current implementation does not account for that. The 
> resulting tree may contain redundant nodes that does not impair its 
> classification performance but are nonetheless unnecessary.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to