Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/5330#issuecomment-92189854
That is it, but I ran some timing tests locally and found essentially no
difference between the two implementations, like you reported. I think the
issue is the overhead of broadcast variables. I tried broadcasting the full
arrays for evaluateEachIteration(), rather than each element separately, and it
made evaluateEachIteration() take about 2/3 of the original time. This was
with depth-2 trees and 100 iterations of regression on a tiny test dataset
("abalone" from [libsvm's copy of the UCI
dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html).
Based on this bit of testing, I would guess the best solution will be to
handle learning and evaluateEachIteration separately:
* Learning: Do not broadcast trees or weights (but do use the caching you
implemented here).
* Communicating 1 tree should be a tiny cost compared to the cost of
learning the tree.
* evaluateEachIteration: Broadcast full tree array. Compute errors for all
iterations in a single map() call, and aggregate arrays of errors rather than
individual errors.
* Don't broadcast the weights array since it is small.
I'm OK with merging this PR for now and making those items a future to-do.
But if you'd prefer to make these updates to this PR, that works too.
Do you agree with this assessment? What path would you prefer?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]