137alpha commented on pull request #32813:
URL: https://github.com/apache/spark/pull/32813#issuecomment-857184357


   @srowen  @asolimando Agreed that the ideal outcome would be to expose this 
as a parameter so the user can change this.
   
   I think the default behaviour here should be Prune = False in order to be 
compliant with the standard behaviour expected from decision trees. As an 
alternative (although less ideal), the web documentation and the API 
documentation needs a substantial warning that the default behaviour of Prune = 
True will give poor performance on unbalanced data sets and for probability 
estimation use cases.
   
   As you observe @asolimando, changing the default behaviour to Prune = False 
would trigger an unexpected performance regression for users at prediction 
time. 
   
   > If possible, it would be great to either fix the current "optimization" by 
looking at more information than the class prediction (notably, the 
probability), or at least provide a user-facing parameter to control the 
behaviour, so who needs (2)/(3) can disable it, who is happy with just (1) can 
benefit from it.
   
   > Right on @137alpha - would it be correct to say the pruning would be 
'correct' if they had the same class probs? If that's the kind of thing that 
could make it work, OK. But I then wonder how much is prunable under that 
definition, rendering the process possibly far less useful for speedup.
   
   Exactly right. You'll still get the current pruning behaviour in the trivial 
cases "probability = 0" and "probability = 1", and some other cases where 
minInstancesPerNode is small. (Eg, if you have minInstancesPerNode = 3 then the 
terminal node probabilities can be 0, 1/3, 2/3 or 3/3=1, so you might get some 
speedup there. If you have minInstancesPerNode = 1 then you will get similar 
pruning of terminal nodes with only one data point in as the current case).
   
   But if minInstancesPerNode is >> 1 then in general there will be minimal 
speedup. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to