[Scikit-learn-general] sklearn.preprocessing: robust scaling and general refactoring of scaling functionality

Thomas Unterthiner Thu, 03 Oct 2013 06:59:12 -0700

Hi there!

the http://scikit-learn.org homepage recommends posting on this mailing 
list before making major contributions, so here it goes:



sklearn.preprocessing currently offers both a scale() function and a 
StandardScaler transformer, as well as a MinMaxScaler.

I'd like to add a `RobustScaler`, which works just like the 
StandardScaler, but uses the median for centering and the interquartile 
range for scaling, which are more robust statistics with regard to 
outliers. In my personal work I often work with noisy data where such a 
robust normalization typically gives better results. (Also, it can be 
shown that e.g. the sample median is a better estimate for the 
population mean than the sample mean if the data is Laplacian 
distributed, so there's that, too). But don't know if there's enough 
general interest in this for me to add it.

I have a version of this already working in my private code, but while I 
was browsing the sklearn source I noticed that StandardScaler and 
MinMaxScaler (and the scale() function)  contain a lot of 
code-duplication. Thus I'd like to introduce a common base class so that 
Standard/MinMax/RobustScaler all share a common code to deal with sparse 
matrices and other parameters (width_mean, copy, ...). The the classes 
themselves would then only differ in how they estimate the 
centering/scaling statistics. This would also get rid of the fact that 
e.g. MinMaxScaler destroys sparsity when given sparse input, while 
StandardScaler takes extra care not to.

However, the cleanest way to do all this would be to rename some of the 
attributes and parameters, which are currently quite inconsistently 
named. E.g. MinMaxScaler has an attribute `scale_`, while StandardScaler 
uses `std_` to store its scaling statistics. I'd thus propose to 
introduce a `BaseScaler` with options `with_centering` and 
`with_scaling` and with attributes `center_` and `scale_`, and derive 
the other scalers from this. It might also make sense to have another 
option `axis` which allows to choose on which axis to scale/normalize  
(similar to how the "scale()" function does). Of course, the old 
attribute-names would have to be deprecated and be removed a few 
releases later.

Additionally, I'd like to add a `robust_scale` function, analog to the 
`scale` function. Both of these should internally use the 
Robust/StandardScaler classes, as quite now there is a lot of duplicated 
code between StandardScaler and scale for no good reason.

So, is there be any interest in these modifications/enhancements?

Cheers


Thomas

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] sklearn.preprocessing: robust scaling and general refactoring of scaling functionality

Reply via email to