Github user chouqin commented on the pull request:
https://github.com/apache/spark/pull/2332#issuecomment-55053558
@jkbradley thanks for your comments, I will change my code accordingly. As
for the Predict class, I still think it is needed, for the following reasons:
1. Saving of computation, for each split, it will traverse a array of bins
two times(one to add, one to find the index that has the maximum value), I
don't think this saving is trival.
2. As for code simplicity, I think predict value for a node should not tied
to information gain for a split(information gain ). I have read Weka and
scikit-learn's decision tree code, they don't store a predict value along with
split's information gain stats. I think the changed code may be easy to
understand somehow.
3. For the early return of calculate `calculateGainForSplit`, when left
child or right child has less than `minInstancesPerNode` we can just return an
invalid information gain stats, without calculate the predict value.If all
splits are early returned, we need a way to calculate the predict value for
current node.
If you don't like creating a new `Predict` class, I can use a tuple to
replace that, but this seems to be harder to understand.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]