[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

sethah Mon, 22 Feb 2016 20:37:50 -0800

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-187524117
  
    I noticed a problem with the current implementation regarding the 
`minInstancesPerNode` parameter. The number of  _instances_ in each node is now 
a weighted count where the weights can have an arbitrary scale. For example, a 
tree built with uniform weights where each weight is equal to 1.0 will build a 
different tree than uniform weights where each weight is 1.0 / N (N is number 
of samples). I suppose there are a number of ways to mitigate this. 
    
    I checked scikit-learn and they track the actual raw sample counts 
(unweighted) as well as the sample weights. They use `min_samples_leaf` to 
compute validity based on raw counts, and `min_weight_fraction_leaf` to compute 
validity based on weighted counts. This will not be possible under the current 
implementation here because we lose the raw counts when we convert to 
`unadjustedBaggedInput` to `baggedInput`. We could compare weighted split 
counts vs `minInstancesPerNode / N` where N is number of training samples, or 
we could adjust the `BaggedPoint` class to store counts and weight and proceed 
ala scikit-learn. I'm not sure what is best, thoughts?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Reply via email to