Hi, Fabrizio, sure, it makes absolute sense to standardize your data if you are using models such as Linear SVM, logistic regression etc. -- in fact I can only think of decision trees / random forest where standardization may be redundant.
Standardization will center your data and bring them onto a similar scale. Imagine you learn the weights via gradient descent, if you have two features, the first one in the range 1-10, and the second one in a range 1-10000. Then, the learning algorithm will mostly be busy updating the weights with respect to feature 2, because the cost (e.g., picture a simple sum-of-squared cost function in linear regression) will tend to be much larger for feature 2. Also, the mean centering may be important for optimal behavior since the weights are typically initialized to 0 or small random values in most implementations. Best, Sebastian > On Apr 29, 2015, at 11:13 AM, Fabrizio Fasano <han...@gmail.com> wrote: > > Dear experts, > > I’m experiencing a dramatic improvement in cross-validation when data are > standardised > > I mean accuracy increased from 48% to 100% when I shift from X to X_scaled = > preprocessing.scale(X) > > Does it make sense in your opinion? > > Thank You a lot for any suggestion, > > Fabrizio > > > > my CODE: > > import numpy as np > from sklearn import preprocessing > from sklearn.svm import LinearSVC > from sklearn.cross_validation import StratifiedShuffleSplit > > # 14 features, 16 samples dataset > data = loadtxt(“data.txt") > y=data[:,0] > X=data[:,1:15] > X_scaled = preprocessing.scale(X) > > sss = StratifiedShuffleSplit(y, 10000, test_size=0.25, random_state=0) > clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1) > cv_scores=[] > > for train_index, test_index in sss: > X_train, X_test = X_scaled[train_index], X_scaled[test_index] > y_train, y_test = y[train_index], y[test_index] > clf.fit(X_train, y_train) > y_pred = clf.predict(X_test) > cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test))) > > print "Accuracy ", np.ceil(100*np.mean(cv_scores)), "+/-", > np.ceil(200*np.std(cv_scores)) > > > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight. > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general