huaxingao opened a new pull request #27954: [SPARK-31885][ML] Implement VarianceThresholdSelector URL: https://github.com/apache/spark/pull/27954 ### What changes were proposed in this pull request? Implement a Feature selector that removes all low-variance features. Features with a variance lower than the threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples. ### Why are the changes needed? VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power. scikit has implemented this selector. https://scikit-learn.org/stable/modules/feature_selection.html#variance-threshold ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? Add new test suite.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
