Re: [Scikit-learn-general] Dramatic improvement by standardizing data?

Fabrizio Fasano Thu, 30 Apr 2015 00:51:07 -0700

Thank you very much Sebastian,

I was very surprised by a so significant improvement, but if it is something 
"not uncommon" due to the reasons you explained above... that’s great,


Best,

Fabrizio


> On 29 Apr 2015, at 17:40, Sebastian Raschka <se.rasc...@gmail.com> wrote:
> 
> Hi, Fabrizio,
> 
> sure, it makes absolute sense to standardize your data if you are using 
> models such as Linear SVM, logistic regression etc. -- in fact I can only 
> think of decision trees / random forest where standardization may be 
> redundant.
> 
> Standardization will center your data and bring them onto a similar scale. 
> Imagine you learn the weights via gradient descent, if you have two features, 
> the first one in the range 1-10, and the second one in a range 1-10000. Then, 
> the learning algorithm will mostly be busy updating the weights with respect 
> to feature 2, because the cost (e.g., picture a simple sum-of-squared cost 
> function in linear regression) will tend to be much larger for feature 2. 
> Also, the mean centering may be important for optimal behavior since the 
> weights are typically initialized to 0 or small random values in most 
> implementations.
> 
> Best,
> Sebastian 
> 
>> On Apr 29, 2015, at 11:13 AM, Fabrizio Fasano <han...@gmail.com> wrote:
>> 
>> Dear experts,
>> 
>> I’m experiencing a dramatic improvement in cross-validation when data are 
>> standardised 
>> 
>> I mean accuracy increased from 48% to 100% when I shift from X to X_scaled = 
>> preprocessing.scale(X)
>> 
>> Does it make sense in your opinion?
>> 
>> Thank You a lot for any suggestion,
>> 
>> Fabrizio
>> 
>> 
>> 
>> my CODE:
>> 
>> import numpy as np
>> from sklearn import preprocessing
>> from sklearn.svm import LinearSVC
>> from sklearn.cross_validation import StratifiedShuffleSplit
>> 
>> # 14 features, 16 samples dataset
>> data = loadtxt(“data.txt")
>> y=data[:,0]
>> X=data[:,1:15]
>> X_scaled = preprocessing.scale(X)
>> 
>> sss = StratifiedShuffleSplit(y, 10000, test_size=0.25, random_state=0)
>> clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
>> cv_scores=[]
>> 
>> for train_index, test_index in sss:
>>  X_train, X_test = X_scaled[train_index], X_scaled[test_index]
>>  y_train, y_test = y[train_index], y[test_index]
>>  clf.fit(X_train, y_train)
>>  y_pred = clf.predict(X_test)
>>  cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))
>> 
>> print "Accuracy ", np.ceil(100*np.mean(cv_scores)), "+/-", 
>> np.ceil(200*np.std(cv_scores))
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> One dashboard for servers and applications across Physical-Virtual-Cloud 
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable Insights
>> Deep dive visibility with transaction tracing using APM Insight.
>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud 
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Dramatic improvement by standardizing data?

Reply via email to