Thank you very much Sebastian, I was very surprised by a so significant improvement, but if it is something "not uncommon" due to the reasons you explained above... that’s great,
Best, Fabrizio > On 29 Apr 2015, at 17:40, Sebastian Raschka <se.rasc...@gmail.com> wrote: > > Hi, Fabrizio, > > sure, it makes absolute sense to standardize your data if you are using > models such as Linear SVM, logistic regression etc. -- in fact I can only > think of decision trees / random forest where standardization may be > redundant. > > Standardization will center your data and bring them onto a similar scale. > Imagine you learn the weights via gradient descent, if you have two features, > the first one in the range 1-10, and the second one in a range 1-10000. Then, > the learning algorithm will mostly be busy updating the weights with respect > to feature 2, because the cost (e.g., picture a simple sum-of-squared cost > function in linear regression) will tend to be much larger for feature 2. > Also, the mean centering may be important for optimal behavior since the > weights are typically initialized to 0 or small random values in most > implementations. > > Best, > Sebastian > >> On Apr 29, 2015, at 11:13 AM, Fabrizio Fasano <han...@gmail.com> wrote: >> >> Dear experts, >> >> I’m experiencing a dramatic improvement in cross-validation when data are >> standardised >> >> I mean accuracy increased from 48% to 100% when I shift from X to X_scaled = >> preprocessing.scale(X) >> >> Does it make sense in your opinion? >> >> Thank You a lot for any suggestion, >> >> Fabrizio >> >> >> >> my CODE: >> >> import numpy as np >> from sklearn import preprocessing >> from sklearn.svm import LinearSVC >> from sklearn.cross_validation import StratifiedShuffleSplit >> >> # 14 features, 16 samples dataset >> data = loadtxt(“data.txt") >> y=data[:,0] >> X=data[:,1:15] >> X_scaled = preprocessing.scale(X) >> >> sss = StratifiedShuffleSplit(y, 10000, test_size=0.25, random_state=0) >> clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1) >> cv_scores=[] >> >> for train_index, test_index in sss: >> X_train, X_test = X_scaled[train_index], X_scaled[test_index] >> y_train, y_test = y[train_index], y[test_index] >> clf.fit(X_train, y_train) >> y_pred = clf.predict(X_test) >> cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test))) >> >> print "Accuracy ", np.ceil(100*np.mean(cv_scores)), "+/-", >> np.ceil(200*np.std(cv_scores)) >> >> >> >> ------------------------------------------------------------------------------ >> One dashboard for servers and applications across Physical-Virtual-Cloud >> Widest out-of-the-box monitoring support with 50+ applications >> Performance metrics, stats and reports that give you Actionable Insights >> Deep dive visibility with transaction tracing using APM Insight. >> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight. > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general