mgaido91 opened a new pull request #23773: [SPARK-26721][ML] Avoid per-tree normalization in featureImportance for GBT URL: https://github.com/apache/spark/pull/23773 ## What changes were proposed in this pull request? Our feature importance calculation is taken from sklearn's one, which has been recently fixed (in https://github.com/scikit-learn/scikit-learn/pull/11176). Citing the description of that PR: > Because the feature importances are (currently, by default) normalized and then averaged, feature importances from later stages are overweighted. The PR performs a fix similar to sklearn's one. The per-tree normalization of the feature importance is skipped and GBT. Credits for pointing out clearly the issue and the sklearn's PR to Daniel Jumper. ## How was this patch tested? modified UT, checked that the computed `featureImportance` in that test is similar to sklearn's one (ti can't be the same, because the trees may be slightly different)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
