Hi Fabrizio,
standardizing over train and test together is a classic way of leaking
train data information into the test set. Standardizing train and test
splits separately is OK or not depending on the situation: If you are
interested in predicting correctly each individual test sample, then this
is not appropriate, since you are linking the test data points. If you are
interested in a global measure, then it may be OK to standardize over test
samples.
A way to avoid errors in this regard is to use a scikit-learn Pipeline with
a StandardScaler prepended to your estimator. This object will estimate
mean and sdev on the training set and standardize the test set using those
estimated values. If this method worsens your results, there may be an
unaccounted-for trend in your data.
Michael
On Thu, Apr 30, 2015 at 10:32 AM, Fabrizio Fasano <
fabrizio.fas...@nemo.unipr.it> wrote:
> Hi Kyle,
>
> I standardised separately the train and test set for each permutation, and
> it reduced the accuracy from 100% to 75%
> I can argue that standardising train and test set together introduced some
> bias resulting in an erroneously higher evaluation of the accuracy
>
> does it sound plausible?
>
> Thank You again,
>
> Best,
> Fabrizio
>
>
> > On 29 Apr 2015, at 17:36, Kyle Kastner <kastnerk...@gmail.com> wrote:
> >
> > Data preprocessing is important. One thing you might want to do is get
> > your preprocessing scaling values over the training data - technically
> > getting the value over the whole dataset is not valid as that includes
> > the test data.
> >
> > It is hard to say whether 100% is believable or not, but you should
> > probably only take scaling over training data.
> >
> > On Wed, Apr 29, 2015 at 11:13 AM, Fabrizio Fasano <han...@gmail.com>
> wrote:
> >> Dear experts,
> >>
> >> I’m experiencing a dramatic improvement in cross-validation when data
> are standardised
> >>
> >> I mean accuracy increased from 48% to 100% when I shift from X to
> X_scaled = preprocessing.scale(X)
> >>
> >> Does it make sense in your opinion?
> >>
> >> Thank You a lot for any suggestion,
> >>
> >> Fabrizio
> >>
> >>
> >>
> >> my CODE:
> >>
> >> import numpy as np
> >> from sklearn import preprocessing
> >> from sklearn.svm import LinearSVC
> >> from sklearn.cross_validation import StratifiedShuffleSplit
> >>
> >> # 14 features, 16 samples dataset
> >> data = loadtxt(“data.txt")
> >> y=data[:,0]
> >> X=data[:,1:15]
> >> X_scaled = preprocessing.scale(X)
> >>
> >> sss = StratifiedShuffleSplit(y, 10000, test_size=0.25, random_state=0)
> >> clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
> >> cv_scores=[]
> >>
> >> for train_index, test_index in sss:
> >> X_train, X_test = X_scaled[train_index], X_scaled[test_index]
> >> y_train, y_test = y[train_index], y[test_index]
> >> clf.fit(X_train, y_train)
> >> y_pred = clf.predict(X_test)
> >> cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))
> >>
> >> print "Accuracy ", np.ceil(100*np.mean(cv_scores)), "+/-",
> np.ceil(200*np.std(cv_scores))
> >>
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> One dashboard for servers and applications across Physical-Virtual-Cloud
> >> Widest out-of-the-box monitoring support with 50+ applications
> >> Performance metrics, stats and reports that give you Actionable Insights
> >> Deep dive visibility with transaction tracing using APM Insight.
> >> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> Scikit-learn-general@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> ------------------------------------------------------------------------------
> > One dashboard for servers and applications across Physical-Virtual-Cloud
> > Widest out-of-the-box monitoring support with 50+ applications
> > Performance metrics, stats and reports that give you Actionable Insights
> > Deep dive visibility with transaction tracing using APM Insight.
> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general