Github user sethah commented on the pull request:
https://github.com/apache/spark/pull/9008#issuecomment-187524117
I noticed a problem with the current implementation regarding the
`minInstancesPerNode` parameter. The number of _instances_ in each node is now
a weighted count where the weights can have an arbitrary scale. For example, a
tree built with uniform weights where each weight is equal to 1.0 will build a
different tree than uniform weights where each weight is 1.0 / N (N is number
of samples). I suppose there are a number of ways to mitigate this.
I checked scikit-learn and they track the actual raw sample counts
(unweighted) as well as the sample weights. They use `min_samples_leaf` to
compute validity based on raw counts, and `min_weight_fraction_leaf` to compute
validity based on weighted counts. This will not be possible under the current
implementation here because we lose the raw counts when we convert to
`unadjustedBaggedInput` to `baggedInput`. We could compare weighted split
counts vs `minInstancesPerNode / N` where N is number of training samples, or
we could adjust the `BaggedPoint` class to store counts and weight and proceed
ala scikit-learn. I'm not sure what is best, thoughts?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]