137alpha commented on pull request #32813:
URL: https://github.com/apache/spark/pull/32813#issuecomment-856441296


   Hello, I am the author of the Jira ticket 
https://issues.apache.org/jira/browse/SPARK-34591. 
   
   In my view, the behaviour described in the ticket is a serious problem - it 
makes the DecisionTreeClassifier and the RandomForestClassifier seriously 
unreliable for probability estimation problems for Spark 2.4.0 and all later 
versions.
   
   Additionally, the original implementation of the feature did not update the 
Spark ML documentation to describe this non-standard modification to the tree 
algorithm. The only way I could trace the behaviour (given that it was in 
conflict with the Spark documentation) was to examine every Jira ticket 
referenced in the release notes after Spark 2.3.0 (where I knew this problem 
did not exist) to identify ones that might be responsible.
   
   In my own experience, I have three clients which have been directly affected 
by this issue.
   
   The Jira ticket gives a minimal example with "maximally worst" behaviour - a 
tree that is pruned (outside the user's control) so that there are no splits at 
all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to