asolimando commented on pull request #32813: URL: https://github.com/apache/spark/pull/32813#issuecomment-857172105
@137alpha thanks for the detailed explanation, now it's clear what you meant (I have missed the example from "xujiajin". What my comment meant was talking exclusively at cases revolving around (1), that is what I have worked on when I contributed the PR (performance improvement of classification tasks, and rule extraction from DecisionTrees and RandomForests). For (2) and (3) this indeed seems problematic, because we disregard the probability entirely. If possible, it would be great to either fix the current "optimization" by looking at more information than the class prediction (notably, the probability), or at least provide a user-facing parameter to control the behaviour, so who needs (2)/(3) can disable it, who is happy with just (1) can benefit from it. Regarding the documentation update, at the time it did not seem relevant, because the contribution seemed an internal optimization (that is, an iso-functional improvement), it's probably a good idea to add a comment for describing the behaviour of the controlling parameter proposed by @srowen. As a closing remark, I understand that this have caused some issues and frustrations to some people including yourself, but sometimes trying to make things better (maybe by volunteering in our spare time, like it was the case for me for this PR), we can cause other issues, which can in turn be tackled and hopefully solved, that's the beauty of OSS. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
