137alpha commented on pull request #32813: URL: https://github.com/apache/spark/pull/32813#issuecomment-856441296
Hello, I am the author of the Jira ticket https://issues.apache.org/jira/browse/SPARK-34591. In my view, the behaviour described in the ticket is a serious problem - it makes the DecisionTreeClassifier and the RandomForestClassifier seriously unreliable for probability estimation problems for Spark 2.4.0 and all later versions. Additionally, the original implementation of the feature did not update the Spark ML documentation to describe this non-standard modification to the tree algorithm. The only way I could trace the behaviour (given that it was in conflict with the Spark documentation) was to examine every Jira ticket referenced in the release notes after Spark 2.3.0 (where I knew this problem did not exist) to identify ones that might be responsible. In my own experience, I have three clients which have been directly affected by this issue. The Jira ticket gives a minimal example with "maximally worst" behaviour - a tree that is pruned (outside the user's control) so that there are no splits at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
