[
https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Deneche A. Hakim updated MAHOUT-122:
------------------------------------
Attachment: refimp_Jul7.diff
I did some tests on the "poker hand" dataset from UCI, it contains 8
categorical attributes and 1.000.000 instances. I got the following results (50
trees) :
|| Ratio || Default || Optimized ||
| 100% | 11m 31s 253 | 8m 32s 446 |
It seems that the default implementation is fast enough for categorical
attributes, and the optimized version is faster.
I also found the issue with the oob error estimation. The old code was:
{code}
Data bag = data.bagging(rng);
Node tree = treeBuilder.build(bag);
// predict the label for the out-of-bag elements
for (int index = 0; index < data.size(); index++) {
Instance v = data.get(index);
if (!bag.contains(v)) {
int prediction = tree.classify(v);
callback.prediction(treeId, v, prediction);
}
}
{code}
The problem was with bag.contains(), commenting this test drop the build time
from *21m 8s 473* to *5s 913*. I modified Data.bag() to fill a given boolean
array with which instances are sampled in the bag, and used it as follows:
{code}
Arrays.fill(sampled, false);
Data bag = data.bagging(rng, sampled);
Node tree = treeBuilder.build(bag);
// predict the label for the out-of-bag elements
for (int index = 0; index < data.size(); index++) {
Instance v = data.get(index);
if (sampled[index] == false) {
int prediction = tree.classify(v);
callback.prediction(treeId, v, prediction);
}
}
{code}
The new build time is *6s 777*. I think this issue is solved (for now...)
> Random Forests Reference Implementation
> ---------------------------------------
>
> Key: MAHOUT-122
> URL: https://issues.apache.org/jira/browse/MAHOUT-122
> Project: Mahout
> Issue Type: Task
> Components: Classification
> Affects Versions: 0.2
> Reporter: Deneche A. Hakim
> Attachments: 2w_patch.diff, 3w_patch.diff, refimp_Jul6.diff,
> refimp_Jul7.diff, RF reference.patch
>
>
> This is the first step of my GSOC project. Implement a simple, easy to
> understand, reference implementation of Random Forests (Building and
> Classification). The only requirement here is that "it works"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.