Hi, Fabrizio,

sure, it makes absolute sense to standardize your data if you are using models 
such as Linear SVM, logistic regression etc. -- in fact I can only think of 
decision trees / random forest where standardization may be redundant.

Standardization will center your data and bring them onto a similar scale. 
Imagine you learn the weights via gradient descent, if you have two features, 
the first one in the range 1-10, and the second one in a range 1-10000. Then, 
the learning algorithm will mostly be busy updating the weights with respect to 
feature 2, because the cost (e.g., picture a simple sum-of-squared cost 
function in linear regression) will tend to be much larger for feature 2. Also, 
the mean centering may be important for optimal behavior since the weights are 
typically initialized to 0 or small random values in most implementations.

Best,
Sebastian 

> On Apr 29, 2015, at 11:13 AM, Fabrizio Fasano <han...@gmail.com> wrote:
> 
> Dear experts,
> 
> I’m experiencing a dramatic improvement in cross-validation when data are 
> standardised 
> 
> I mean accuracy increased from 48% to 100% when I shift from X to X_scaled = 
> preprocessing.scale(X)
> 
> Does it make sense in your opinion?
> 
> Thank You a lot for any suggestion,
> 
> Fabrizio
> 
> 
> 
> my CODE:
> 
> import numpy as np
> from sklearn import preprocessing
> from sklearn.svm import LinearSVC
> from sklearn.cross_validation import StratifiedShuffleSplit
> 
> # 14 features, 16 samples dataset
> data = loadtxt(“data.txt")
> y=data[:,0]
> X=data[:,1:15]
> X_scaled = preprocessing.scale(X)
> 
> sss = StratifiedShuffleSplit(y, 10000, test_size=0.25, random_state=0)
> clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
> cv_scores=[]
> 
> for train_index, test_index in sss:
>   X_train, X_test = X_scaled[train_index], X_scaled[test_index]
>   y_train, y_test = y[train_index], y[test_index]
>   clf.fit(X_train, y_train)
>   y_pred = clf.predict(X_test)
>   cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))
> 
> print "Accuracy ", np.ceil(100*np.mean(cv_scores)), "+/-", 
> np.ceil(200*np.std(cv_scores))
> 
> 
> 
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud 
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to