[GitHub] spark pull request: MLI-1 Decision Trees

etrain Wed, 05 Mar 2014 17:57:07 -0800

Github user etrain commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-36817253

This PR is the result of several iterations on the idea of wanting to build
fast decision trees for Spark.

To offer a little more color on the design here, the key ideas are:

1. Histograms to compute best split statistics. We quantize input
attributes and compute best splits via relatively cheap operations on
histograms. This is a precision/speed tradeoff in that we give up the ability
to find "perfect" split points (by the C4.5 algorithm), but still find very
close ones.
2. Minimize total passes over the data by training models "level-wise". At
each level of the tree, we evaluate which node a point belongs in, and allow
the point to contribute to the statistics (histogram bins) for that node only.
While the number of histograms is thus exponential with tree depth, these data
structures are relatively small and we expect that very deep trees are
relatively uncommon (especially in the ensemble case). This also means that we
need only cache the working set once, instead of partitioning the data and
caching the partitions.
3. Want to support arbitrary split statistics (e.g. Gini, Entropy,
Variance), and both categorical and continuous input and response values - I
believe that the current design allows for this efficiently.

I tested the code in this PR for scalability on clusters with 2 slaves
(m2.xlarge nodes) to a cluster with 16 slaves of the same variety. Master was
also an m2.xlarge. Data that was thrown at this PR was generated from the
LinearDataGenerator class. Ranging in size from 10m points to 50m points, and
dimensions d=10 to d=50 - this equates to 700mb of training data up to 18gb of
training data, all models were trained to a maximum depth of 10. Total runtime
on 16 machines ranged from under 5 minutes for the 10m*10 experiment to 36
minutes for the 50m*50 experiment.

Scalability results are shown here:
![screen shot 2014-03-05 at 5 49 07
pm](https://f.cloud.github.com/assets/1326181/2341114/92d71ce6-a4d1-11e3-8efd-82d57dc68c27.png)

On the x-axis we have number of machines (2..16) and the y-axis is speedup.
Each box in the grid represents a different dataset. Ideal scaling would be 8x.

There are several things worth pointing out on this chart:

1. There are some cases where the 2-machine experiments didn't finish (the
last 2 30m experiments and all of the 40m and 50m experiments), and in those
cases ideal speedup is 4x.
2. There are also a couple of cases where we appear to be doing better than
ideal, but those are cases where GC overhead was substantial in the 2 machine
case, so I don't consider those apples-to-apples.
3. Overall, through this strategy of parallelization, we see an average of
a 6.6x speedup for 16 nodes vs. 2 (17% deterioration from ideal) with a
standard deviation in speedup of 0.7 across these experiments for those
observations that I'm calling "fair".



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: MLI-1 Decision Trees

Reply via email to