137alpha commented on pull request #32813: URL: https://github.com/apache/spark/pull/32813#issuecomment-857129700
> I think too that exposing the parameter is the safest option, since the pruning leads to a (sometimes sensible) performance improvement and, at least for prediction tasks, **does not have any downside.** @asolimando this is fundamentally not correct, and this assumption is precisely the root cause of this bug. There are roughly three use cases for decision trees (and random forests that derive from them): 1. Accurately predicting the class of a binary variable (0/1) 2. Creating a score (pseudo-probability) that is used to rank a set of data points on the likelihood that the dependent variable is a member of a class (typically the positive class) - eg propensity models 3. "Probability estimation trees" - Accurately estimating the probability that a data point is a member of a certain class. Use case 3 is facilitated by random forest models, which are provably convergent probability estimators - see Biau et al (2008) "Consistency of random forests and other averaging classifiers" _Journal of Machine Learning Research_, 9, 2015-2033. As described in https://issues.apache.org/jira/browse/SPARK-3159?focusedCommentId=17115343&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17115343 the logic of the pruning code only looks at the class prediction, not the probability of the node. The code makes the assumption "merging nodes where the class prediction is the same is safe", regardless of the fact that the nodes may predict different probabilities. Use cases 2 and 3 above are radically affected by this bug, which derives from the assumption that the class prediction is the only thing of interest. The more imbalanced the data set, the worse the problem is, as the examples on the Jira ticket illustrate. It is no good answer to say "just rebalance the classes", because in use case 3 then accurate probabilities are precisely the goal of the problem, and trees trained on class rebalanced data do not give correct probability predictions without additional post hoc processing. Yes, there is a reduction in run time from this pruning, but there is no benefit at training time, and it's no good having a reduction in prediction time if the answer is wrong. As per the example on the Spark Jira, in a fairly simple example the resulting DecisionTreeClassifier is not usable. So, the performance improvement here might be characterised as "fast but wrong". **Some extra context from my personal experience here** I am a partner at BCG Gamma, one of the world's largest employer of data scientists and data engineers. (We have around 1000 data scientists and data engineers working on advanced analytics problems with clients). I have multiple clients in my personal experience who are have been significantly affected by this. Comments across multiple client situations include the following: * I had thought that I could get accurate probability estimates from the Spark RandomForestClassifier but this turns out not to be true * I use decision trees to gain insight into churn propensity patterns. To do this, I build a decision tree and look at the tree structure. But I have encountered examples where I cannot even seem to get the tree to create a single split, therefore rendering Spark totally useless for this purpose. I am forced to use H2O instead to handle a dataset of this size. This is highly undesirable, because my infrastructure team will not officially support it due to issues it creates on the Spark cluster * I am unable to obtain reasonable scores from the use of the RandomForestClassifier on my data. I am forced to use GBTClassifier (which is unaffected by this) * I have spent hours training a model and then find that the resultant model is pruned via a non-standard process that is outside my control. Whilst there might be a speed-up at prediction time, this is no help to me when the model is unusable. * We noticed that sometimes the RandomForestClassifier gave very poor results, but didn't look into it further, we just used GBTClassifier After reaching out through my network, I have identified other people who have also had poor experiences using the DecisionTreeClassifier and RandomForestClassifier for use cases (2) and (3) above but couldn't work out why given the fact that the Spark documentation is silent on this, and were grateful for being alerted to the undocumented pruning behaviour. Some of my clients have concluded that DecisionTreeClassifier and RandomForestClassifier are functionally broken and should not be used under any circumstances until this is fixed. I'm sorry if this all sounds very negative, but this is a much bigger problem than I think you are grasping at this stage. I am happy to get on a phone call to discuss this further if that's helpful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
