Github user etrain commented on the pull request:

    https://github.com/apache/spark/pull/79#issuecomment-36817253
  
    This PR is the result of several iterations on the idea of wanting to build 
fast decision trees for Spark. 
    
    To offer a little more color on the design here, the key ideas are:
    
    1. Histograms to compute best split statistics. We quantize input 
attributes and compute best splits via relatively cheap operations on 
histograms. This is a precision/speed tradeoff in that we give up the ability 
to find "perfect" split points (by the C4.5 algorithm), but still find very 
close ones.
    2. Minimize total passes over the data by training models "level-wise". At 
each level of the tree, we evaluate which node a point belongs in, and allow 
the point to contribute to the statistics (histogram bins) for that node only. 
While the number of histograms is thus exponential with tree depth, these data 
structures are relatively small and we expect that very deep trees are 
relatively uncommon (especially in the ensemble case). This also means that we 
need only cache the working set once, instead of partitioning the data and 
caching the partitions.
    3. Want to support arbitrary split statistics (e.g. Gini, Entropy, 
Variance), and both categorical and continuous input and response values - I 
believe that the current design allows for this efficiently.
    
    I tested the code in this PR for scalability on clusters with 2 slaves 
(m2.xlarge nodes) to a cluster with 16 slaves of the same variety. Master was 
also an m2.xlarge. Data that was thrown at this PR was generated from the 
LinearDataGenerator class. Ranging in size from 10m points to 50m points, and 
dimensions d=10 to d=50 - this equates to 700mb of training data up to 18gb of 
training data, all models were trained to a maximum depth of 10. Total runtime 
on 16 machines ranged from under 5 minutes for the 10m*10 experiment to 36 
minutes for the 50m*50 experiment.
    
    Scalability results are shown here:
    ![screen shot 2014-03-05 at 5 49 07 
pm](https://f.cloud.github.com/assets/1326181/2341114/92d71ce6-a4d1-11e3-8efd-82d57dc68c27.png)
    
    
    On the x-axis we have number of machines (2..16) and the y-axis is speedup. 
Each box in the grid represents a different dataset. Ideal scaling would be 8x. 
    
    There are several things worth pointing out on this chart:
    
    1. There are some cases where the 2-machine experiments didn't finish (the 
last 2 30m experiments and all of the 40m and 50m experiments), and in those 
cases ideal speedup is 4x. 
    2. There are also a couple of cases where we appear to be doing better than 
ideal, but those are cases where GC overhead was substantial in the 2 machine 
case, so I don't consider those apples-to-apples.
    3. Overall, through this strategy of parallelization, we see an average of 
a 6.6x speedup for 16 nodes vs. 2 (17% deterioration from ideal) with a 
standard deviation in speedup of 0.7 across these experiments for those 
observations that I'm calling "fair". 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to