Github user erikerlandson commented on the issue:

    https://github.com/apache/spark/pull/13440
  
    @srowen I discuss some of these questions in the [blog 
post](http://erikerlandson.github.io/blog/2016/05/26/measuring-decision-tree-split-quality-with-test-statistic-p-values/),
 but the tl/dr is that split quality measures based on statistical tests having 
p-values are in some senses "less arbitrary."  Specifying a p-value as a split 
quality halting condition has essentially the same semantic regardless of the 
test.  Most such tests also intrinsically take into account decreasing 
population sizes.  As the the splitting progresses and population sizes 
decrease, it inherently takes a larger and larger population difference to meet 
the p-value threshold.
    
    On the more pragmatic side, in that post I also demonstrate chi-squared 
split quality generating a more parsimonious tree than other metrics, which 
does a better job at ignoring poor quality features.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to