Re: [Scikit-learn-general] Dramatic improvement by standardizing data?

Fabrizio Fasano Thu, 30 Apr 2015 02:47:38 -0700

Ops,

I created the pipeline as you suggested, and the accuracy was found to be 94% 
(wrong information 75% in my last email, I didn’t scale the test set)


do you think my implementation of your suggestion is right?

thank you so much,

Fabrizio
 

CODE:

print "\nWhen a stratified shuffle split is apllied" 
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(y, 10000, test_size=0.25, random_state=0)
clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
cv_scores=[]
scaler=StandardScaler()

for train_index, test_index in sss:
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]
   # scaling train set and applying parameters to test set
   X_train = scaler.fit_transform(X_train)
   X_test = scaler.transform(X_test)
   # train and test
   clf.fit(X_train, y_train)
   y_pred = clf.predict(X_test)
   cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))
   

> On 30 Apr 2015, at 10:39, Michael Eickenberg <michael.eickenb...@gmail.com> 
> wrote:
> 
> Hi Fabrizio,
> 
> standardizing over train and test together is a classic way of leaking train 
> data information into the test set. Standardizing train and test splits 
> separately is OK or not depending on the situation: If you are interested in 
> predicting correctly each individual test sample, then this is not 
> appropriate, since you are linking the test data points. If you are 
> interested in a global measure, then it may be OK to standardize over test 
> samples.
> 
> A way to avoid errors in this regard is to use a scikit-learn Pipeline with a 
> StandardScaler prepended to your estimator. This object will estimate mean 
> and sdev on the training set and standardize the test set using those 
> estimated values. If this method worsens your results, there may be an 
> unaccounted-for trend in your data.
> 
> Michael
> 
> 
> On Thu, Apr 30, 2015 at 10:32 AM, Fabrizio Fasano 
> <fabrizio.fas...@nemo.unipr.it <mailto:fabrizio.fas...@nemo.unipr.it>> wrote:
> Hi Kyle,
> 
> I standardised separately the train and test set for each permutation, and it 
> reduced the accuracy from 100% to 75%
> I can argue that standardising train and test set together introduced some 
> bias resulting in an erroneously higher evaluation of the accuracy
> 
> does it sound plausible?
> 
> Thank You again,
> 
> Best,
> Fabrizio
> 
> 
> > On 29 Apr 2015, at 17:36, Kyle Kastner <kastnerk...@gmail.com 
> > <mailto:kastnerk...@gmail.com>> wrote:
> >
> > Data preprocessing is important. One thing you might want to do is get
> > your preprocessing scaling values over the training data - technically
> > getting the value over the whole dataset is not valid as that includes
> > the test data.
> >
> > It is hard to say whether 100% is believable or not, but you should
> > probably only take scaling over training data.
> >
> > On Wed, Apr 29, 2015 at 11:13 AM, Fabrizio Fasano <han...@gmail.com 
> > <mailto:han...@gmail.com>> wrote:
> >> Dear experts,
> >>
> >> I’m experiencing a dramatic improvement in cross-validation when data are 
> >> standardised
> >>
> >> I mean accuracy increased from 48% to 100% when I shift from X to X_scaled 
> >> = preprocessing.scale(X)
> >>
> >> Does it make sense in your opinion?
> >>
> >> Thank You a lot for any suggestion,
> >>
> >> Fabrizio
> >>
> >>
> >>
> >> my CODE:
> >>
> >> import numpy as np
> >> from sklearn import preprocessing
> >> from sklearn.svm import LinearSVC
> >> from sklearn.cross_validation import StratifiedShuffleSplit
> >>
> >> # 14 features, 16 samples dataset
> >> data = loadtxt(“data.txt")
> >> y=data[:,0]
> >> X=data[:,1:15]
> >> X_scaled = preprocessing.scale(X)
> >>
> >> sss = StratifiedShuffleSplit(y, 10000, test_size=0.25, random_state=0)
> >> clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
> >> cv_scores=[]
> >>
> >> for train_index, test_index in sss:
> >>   X_train, X_test = X_scaled[train_index], X_scaled[test_index]
> >>   y_train, y_test = y[train_index], y[test_index]
> >>   clf.fit(X_train, y_train)
> >>   y_pred = clf.predict(X_test)
> >>   cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))
> >>
> >> print "Accuracy ", np.ceil(100*np.mean(cv_scores)), "+/-", 
> >> np.ceil(200*np.std(cv_scores))
> >>
> >>
> >>
> >> ------------------------------------------------------------------------------
> >> One dashboard for servers and applications across Physical-Virtual-Cloud
> >> Widest out-of-the-box monitoring support with 50+ applications
> >> Performance metrics, stats and reports that give you Actionable Insights
> >> Deep dive visibility with transaction tracing using APM Insight.
> >> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y 
> >> <http://ad.doubleclick.net/ddm/clk/290420510;117567292;y>
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> Scikit-learn-general@lists.sourceforge.net 
> >> <mailto:Scikit-learn-general@lists.sourceforge.net>
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general 
> >> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
> >
> > ------------------------------------------------------------------------------
> > One dashboard for servers and applications across Physical-Virtual-Cloud
> > Widest out-of-the-box monitoring support with 50+ applications
> > Performance metrics, stats and reports that give you Actionable Insights
> > Deep dive visibility with transaction tracing using APM Insight.
> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y 
> > <http://ad.doubleclick.net/ddm/clk/290420510;117567292;y>
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net 
> > <mailto:Scikit-learn-general@lists.sourceforge.net>
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general 
> > <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
> 
> 
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y 
> <http://ad.doubleclick.net/ddm/clk/290420510;117567292;y>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net 
> <mailto:Scikit-learn-general@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general 
> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
> 
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud 
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y_______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Dramatic improvement by standardizing data?

Reply via email to