Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2180#issuecomment-53746339
@chouqin Thanks for observing that we can sometimes avoid calculating the
prediction and/or the info gain. I'm worried that this won't really change the
scaling of the algorithm much since calculating the prediction is a low-cost
operation. (This computation is done on the master node, and for any
reasonable size dataset, the time spent on the master node is negligible
compared to the time spent on the treeAggregate() call.)
I'm also worried about this PR clashing with the current DecisionTree PR:
[https://github.com/apache/spark/pull/2125], which moves the calculation of
predictions into separate Impurity* classes. Would it be possible to update
this once [https://github.com/apache/spark/pull/2125] has gone through?
At that time, I think this PR could be simplified a bit by removing the
Predict class. InformationGainStats.predict already holds the prediction, and
InformationGainStats.gain can be computed or ignored as needed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]