[
https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910112#comment-13910112
]
Sean Owen commented on MAHOUT-1419:
-----------------------------------
Hmm. The example I was looking at only used 100 trees. On second look, I see
the data set I have here may not be random, since the classifier achieves about
100% accuracy! This dramatically affects how deep the trees are, and, maybe
that explains the difference. In any event I think I'd say that the likely
speed up is indeed far less than 50x, even if it's significant (2x+?). It's
more about patching a case that becomes pathological.
> Random decision forest is excessively slow on numeric features
> --------------------------------------------------------------
>
> Key: MAHOUT-1419
> URL: https://issues.apache.org/jira/browse/MAHOUT-1419
> Project: Mahout
> Issue Type: Bug
> Components: Classification
> Affects Versions: 0.7, 0.8, 0.9
> Reporter: Sean Owen
> Attachments: MAHOUT-1419.patch
>
>
> Follow-up to MAHOUT-1417. There's a customer running this and observing it
> take an unreasonably long time on about 2GB of data -- like, >24 hours when
> other RDF M/R implementations take 9 minutes. The difference is big enough to
> probably be considered a defect. MAHOUT-1417 got that down to about 5 hours.
> I am trying to further improve it.
> One key issue seems to be how splits are evaluated over numeric features. A
> split is tried for every distinct numeric value of the feature in the whole
> data set. Since these are floating point values, they could (and in the
> customer's case are) all distinct. 200K rows means 200K splits to evaluate
> every time a node is built on the feature.
> A better approach is to sample percentiles out of the feature and evaluate
> only those as splits. Really doing that efficiently would require a lot of
> rewrite. However, there are some modest changes possible which get some of
> the benefit, and appear to make it run about 3x faster. That is --on a data
> set that exhibits this problem -- meaning one using numeric features which
> are generally distinct. Which is not exotic.
> There are comparable but different problems with handling of categorical
> features, but that's for a different patch.
> I have a patch, but it changes behavior to some extent since it is evaluating
> only a sample of splits instead of every single possible one. In particular
> it makes the output of "OptIgSplit" no longer match the "DefaultIgSplit".
> Although I think the point is that "optimized" may mean giving different
> choices of split here, which could yield differing trees. So that test
> probably has to go.
> (Along the way I found a number of micro-optimizations in this part of the
> code that added up to maybe a 3% speedup. And fixed an NPE too.)
> I will propose a patch shortly with all of this for thoughts.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)