srowen commented on pull request #32813: URL: https://github.com/apache/spark/pull/32813#issuecomment-857176436
Yeah, great explanation. I'm still remembering how all this code works so probably asking dumb questions. Is the problem that the leaves' impurity stats are not combined, but just use the parent node's? or is that also not quite the point. Where do you get class probabilities out from the API, or are you reaching into the model to figure that out? sorry if I've just forgotten that possibility in the API, but didn't recall or see it. Just trying to trace back how probability connects to the LeafNodes -- via impurity right? your example doesn't seem to retrieve probabilities. The scikit tree shows a lot of "redundant" decision nodes, but if they're redundant, I wonder what else is stores, thus what we need to look at when deciding to prune or not in Spark. I think this does indeed affect a certain type of use case, hardly fully broken, but I do believe you that there's a problem of some size - no need to collect more evidence! The simplest fix I would definitely support is making this, at least, _optional_ rather than disable it by default. Huge forests can be an issue too to load, on the flipside. I'm still hoping there's a fix or misunderstanding somewhere we can save this with, but maybe there is nothing reasonable, and correctness is incompatible with what you're doing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
