[ 
https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718777#action_12718777
 ] 

Deneche A. Hakim commented on MAHOUT-122:
-----------------------------------------

I did some tests on some of the datasets used in Breiman's paper to compare the 
results of the reference implementation.

The test procedure described in Breiman's paper is as follows :
* 10% of the dataset is kept apart as a testing set
* for each dataset we build two forests, one with m=int(log2(M)+1) (called 
Random-Input) and one with m=1 (called Single-Input) 
* we used the forest that gave the lowest oob error estimation to compute the 
test set error
* we compute the test set error using the Single Input Forest, to show that 
even when m=1 Random Forests give comparable results to greater values of m
* we compute the mean test set error using every tree of the forest that gave 
the lowest oob error. This shows how a single Decision Tree performs

In the following tables:
*Selection: test set result with the forest that gave the lowest oob error
*Single Input: test set result with the Single-Input forest
* One Tree: Mean Tree test set error

*Breiman's results :*
|| Data || Selection || Single Input || One Tree ||
| glass | 20.6 | 21.2 | 36.9 | 
| breast cancer | 2.9 | 2.7 | 6.3 | 
| diabetes | 24.2 | 24.3 | 33.1 | 
| sonar | 15.9 | 18.0 | 31.7 | 
| ionosphere | 7.1 | 7.5 | 12.7 | 
| vehicle | 25.8 | 26.4 | 33.1 | 
| german | 24.4 | 26.2 | 33.3 | 

*Reference Implementation results :*
I also included the how much system time (mean) each forest (Random-Input or 
Single-Input) took to build

|| Data || Selection || Single Input || One Tree || Mean RI Time || Mean SI 
Time ||
| glass | 24.8 | 23.9 | 41.2 | 9s  19 | 2s 667 |
| breast cancer | 2.8 | 2.7 | 5.8 | 2s 588 | 1s  60 |
| diabetes | 24.5 | 24.6 | 32.1 | 34s 875 | 10s 284 |
| sonar | 14.6 | 15.3 | 32.3 | 10s  89 | 2s 227 |
| ionosphere | 7.1 | 7.0 | 15.5 | 33s 190 | 6s  96 |    
| vehicle |     25.3 | 26.4 | 33.7 | 42s 194 | 10s  21 |
| german | 23.15 | 25.27 | 32.8 | 10s 203 | 3s 654 |

> Random Forests Reference Implementation
> ---------------------------------------
>
>                 Key: MAHOUT-122
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-122
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>         Attachments: 2w_patch.diff, RF reference.patch
>
>   Original Estimate: 25h
>  Remaining Estimate: 25h
>
> This is the first step of my GSOC project. Implement a simple, easy to 
> understand, reference implementation of Random Forests (Building and 
> Classification). The only requirement here is that "it works"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to