137alpha commented on pull request #32813:
URL: https://github.com/apache/spark/pull/32813#issuecomment-857129700


   > I think too that exposing the parameter is the safest option, since the 
pruning leads to a (sometimes sensible) performance improvement and, at least 
for prediction tasks, **does not have any downside.**
   
   @asolimando this is fundamentally not correct, and this assumption is 
precisely the root cause of this bug.
   
   There are roughly three use cases for decision trees (and random forests 
that derive from them):
   
   1. Accurately predicting the class of a binary variable (0/1)
   2. Creating a score (pseudo-probability) that is used to rank a set of data 
points on the likelihood that the dependent variable is a member of a class 
(typically the positive class) - eg propensity models
   3. "Probability estimation trees" - Accurately estimating the probability 
that a data point is a member of a certain class.
   
   Use case 3 is facilitated by random forest models, which are provably 
convergent probability estimators - see Biau et al (2008) "Consistency of 
random forests and other averaging classifiers" _Journal of Machine Learning 
Research_, 9, 2015-2033.
   
   As described in 
https://issues.apache.org/jira/browse/SPARK-3159?focusedCommentId=17115343&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17115343
 the logic of the pruning code only looks at the class prediction, not the 
probability of the node.
   
   The code makes the assumption "merging nodes where the class prediction is 
the same is safe", regardless of the fact that the nodes may predict different 
probabilities. 
   
   Use cases 2 and 3 above are radically affected by this bug, which derives 
from the assumption that the class prediction is the only thing of interest. 
   
   The more imbalanced the data set, the worse the problem is, as the examples 
on the Jira ticket illustrate. It is no good answer to say "just rebalance the 
classes", because in use case 3 then accurate probabilities are precisely the 
goal of the problem, and trees trained on class rebalanced data do not give 
correct probability predictions without additional post hoc processing.
   
   Yes, there is a reduction in run time from this pruning, but there is no 
benefit at training time, and it's no good having a reduction in prediction 
time if the answer is wrong. As per the example on the Spark Jira, in a fairly 
simple example the resulting DecisionTreeClassifier is not usable. So, the 
performance improvement here might be characterised as "fast but wrong".
   
   **Some extra context from my personal experience here**
   
   I am a partner at BCG Gamma, one of the world's largest employer of data 
scientists and data engineers. (We have around 1000 data scientists and data 
engineers working on advanced analytics problems with clients). 
   
   I have multiple clients in my personal experience who are have been 
significantly affected by this. 
   
   Comments across multiple client situations include the following:
   
   * I had thought that I could get accurate probability estimates from the 
Spark RandomForestClassifier but this turns out not to be true
   * I use decision trees to gain insight into churn propensity patterns. To do 
this, I build a decision tree and look at the tree structure. But I have 
encountered examples where I cannot even seem to get the tree to create a 
single split, therefore rendering Spark totally useless for this purpose. I am 
forced to use H2O instead to handle a dataset of this size. This is highly 
undesirable, because my infrastructure team will not officially support it due 
to issues it creates on the Spark cluster
   * I am unable to obtain reasonable scores from the use of the 
RandomForestClassifier on my data. I am forced to use GBTClassifier (which is 
unaffected by this)
   * I have spent hours training a model and then find that the resultant model 
is pruned via a non-standard process that is outside my control. Whilst there 
might be a speed-up at prediction time, this is no help to me when the model is 
unusable.
   * We noticed that sometimes the RandomForestClassifier gave very poor 
results, but didn't look into it further, we just used GBTClassifier
   
   After reaching out through my network, I have identified other people who 
have also had poor experiences using the DecisionTreeClassifier and 
RandomForestClassifier for use cases (2) and (3) above but couldn't work out 
why given the fact that the Spark documentation is silent on this, and were 
grateful for being alerted to the undocumented pruning behaviour.
   
   Some of my clients have concluded that DecisionTreeClassifier and 
RandomForestClassifier are functionally broken and should not be used under any 
circumstances until this is fixed.
   
   I'm sorry if this all sounds very negative, but this is a much bigger 
problem than I think you are grasping at this stage. I am happy to get on a 
phone call to discuss this further if that's helpful.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to