Github user smurching commented on the issue: https://github.com/apache/spark/pull/19433 The failing SparkR test (which compares `RandomForest` predictions to hardcoded values) fails not due to a correctness issue but (AFAICT) because of an implementation change in best-split selection. In this PR we recompute parent node impurity stats when considering each split for a feature, instead of computing parent impurity stats once per feature (see this by comparing `RandomForest.calculateImpurityStats` in Spark master and `ImpurityUtils.calculateImpurityStats` in this PR). The process of repeatedly computing parent impurity stats results in slightly different results at each iteration due to Double precision limitations. This in turn can cause different splits to be selected (e.g. if two splits have mathematically equal gains, Double precision limitations can cause one split to have a higher/smaller gain than the other, influencing tiebreaking).
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org