137alpha commented on pull request #32813: URL: https://github.com/apache/spark/pull/32813#issuecomment-857184357
@srowen @asolimando Agreed that the ideal outcome would be to expose this as a parameter so the user can change this. I think the default behaviour here should be Prune = False in order to be compliant with the standard behaviour expected from decision trees. As an alternative (although less ideal), the web documentation and the API documentation needs a substantial warning that the default behaviour of Prune = True will give poor performance on unbalanced data sets and for probability estimation use cases. As you observe @asolimando, changing the default behaviour to Prune = False would trigger an unexpected performance regression for users at prediction time. > If possible, it would be great to either fix the current "optimization" by looking at more information than the class prediction (notably, the probability), or at least provide a user-facing parameter to control the behaviour, so who needs (2)/(3) can disable it, who is happy with just (1) can benefit from it. > Right on @137alpha - would it be correct to say the pruning would be 'correct' if they had the same class probs? If that's the kind of thing that could make it work, OK. But I then wonder how much is prunable under that definition, rendering the process possibly far less useful for speedup. Exactly right. You'll still get the current pruning behaviour in the trivial cases "probability = 0" and "probability = 1", and some other cases where minInstancesPerNode is small. (Eg, if you have minInstancesPerNode = 3 then the terminal node probabilities can be 0, 1/3, 2/3 or 3/3=1, so you might get some speedup there. If you have minInstancesPerNode = 1 then you will get similar pruning of terminal nodes with only one data point in as the current case). But if minInstancesPerNode is >> 1 then in general there will be minimal speedup. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
