Github user erikerlandson commented on the issue:
https://github.com/apache/spark/pull/13440
@srowen I discuss some of these questions in the [blog
post](http://erikerlandson.github.io/blog/2016/05/26/measuring-decision-tree-split-quality-with-test-statistic-p-values/),
but the tl/dr is that split quality measures based on statistical tests having
p-values are in some senses "less arbitrary." Specifying a p-value as a split
quality halting condition has essentially the same semantic regardless of the
test. Most such tests also intrinsically take into account decreasing
population sizes. As the the splitting progresses and population sizes
decrease, it inherently takes a larger and larger population difference to meet
the p-value threshold.
On the more pragmatic side, in that post I also demonstrate chi-squared
split quality generating a more parsimonious tree than other metrics, which
does a better job at ignoring poor quality features.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]