Hi there! the http://scikit-learn.org homepage recommends posting on this mailing list before making major contributions, so here it goes:
sklearn.preprocessing currently offers both a scale() function and a StandardScaler transformer, as well as a MinMaxScaler. I'd like to add a `RobustScaler`, which works just like the StandardScaler, but uses the median for centering and the interquartile range for scaling, which are more robust statistics with regard to outliers. In my personal work I often work with noisy data where such a robust normalization typically gives better results. (Also, it can be shown that e.g. the sample median is a better estimate for the population mean than the sample mean if the data is Laplacian distributed, so there's that, too). But don't know if there's enough general interest in this for me to add it. I have a version of this already working in my private code, but while I was browsing the sklearn source I noticed that StandardScaler and MinMaxScaler (and the scale() function) contain a lot of code-duplication. Thus I'd like to introduce a common base class so that Standard/MinMax/RobustScaler all share a common code to deal with sparse matrices and other parameters (width_mean, copy, ...). The the classes themselves would then only differ in how they estimate the centering/scaling statistics. This would also get rid of the fact that e.g. MinMaxScaler destroys sparsity when given sparse input, while StandardScaler takes extra care not to. However, the cleanest way to do all this would be to rename some of the attributes and parameters, which are currently quite inconsistently named. E.g. MinMaxScaler has an attribute `scale_`, while StandardScaler uses `std_` to store its scaling statistics. I'd thus propose to introduce a `BaseScaler` with options `with_centering` and `with_scaling` and with attributes `center_` and `scale_`, and derive the other scalers from this. It might also make sense to have another option `axis` which allows to choose on which axis to scale/normalize (similar to how the "scale()" function does). Of course, the old attribute-names would have to be deprecated and be removed a few releases later. Additionally, I'd like to add a `robust_scale` function, analog to the `scale` function. Both of these should internally use the Robust/StandardScaler classes, as quite now there is a lot of duplicated code between StandardScaler and scale for no good reason. So, is there be any interest in these modifications/enhancements? Cheers Thomas ------------------------------------------------------------------------------ October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
