Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5330#issuecomment-92189854
  
    That is it, but I ran some timing tests locally and found essentially no 
difference between the two implementations, like you reported.  I think the 
issue is the overhead of broadcast variables.  I tried broadcasting the full 
arrays for evaluateEachIteration(), rather than each element separately, and it 
made evaluateEachIteration() take about 2/3 of the original time.  This was 
with depth-2 trees and 100 iterations of regression on a tiny test dataset 
("abalone" from [libsvm's copy of the UCI 
dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html).
    
    Based on this bit of testing, I would guess the best solution will be to 
handle learning and evaluateEachIteration separately:
    * Learning: Do not broadcast trees or weights (but do use the caching you 
implemented here).
      * Communicating 1 tree should be a tiny cost compared to the cost of 
learning the tree.
    * evaluateEachIteration: Broadcast full tree array.  Compute errors for all 
iterations in a single map() call, and aggregate arrays of errors rather than 
individual errors.
      * Don't broadcast the weights array since it is small.
    
    I'm OK with merging this PR for now and making those items a future to-do.  
But if you'd prefer to make these updates to this PR, that works too.
    
    Do you agree with this assessment?  What path would you prefer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to