[GitHub] spark pull request: [SPARK-2207][SPARK-3272]Add minimum informatio...

chouqin Tue, 09 Sep 2014 16:52:23 -0700

Github user chouqin commented on the pull request:

    https://github.com/apache/spark/pull/2332#issuecomment-55053558
  
    @jkbradley thanks for your comments, I will change my code accordingly. As 
for the Predict class, I still think it is needed, for the following reasons:
    
    1. Saving of computation, for each split, it will traverse a array of bins 
two times(one to add, one to find the index that has the maximum value), I 
don't think this saving is trival. 
    
    2. As for code simplicity, I think predict value for a node should not tied 
to information gain for a split(information gain ). I have read Weka and 
scikit-learn's decision tree code, they don't store a predict value along with 
split's information gain stats. I think the changed code may be easy to 
understand somehow.
    
    3. For the early return of calculate `calculateGainForSplit`, when left 
child or right child has less than `minInstancesPerNode` we can just return an 
invalid information gain stats, without calculate the predict value.If all 
splits are early returned, we need a way to calculate the predict value for 
current node.
    
    If you don't like creating a new `Predict` class, I can use a tuple to 
replace that, but this seems to be harder to understand.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2207][SPARK-3272]Add minimum informatio...

Reply via email to