Ops,
I created the pipeline as you suggested, and the accuracy was found to be 94%
(wrong information 75% in my last email, I didn’t scale the test set)
do you think my implementation of your suggestion is right?
thank you so much,
Fabrizio
CODE:
print "\nWhen a stratified shuffle split is apllied"
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(y, 10000, test_size=0.25, random_state=0)
clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
cv_scores=[]
scaler=StandardScaler()
for train_index, test_index in sss:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# scaling train set and applying parameters to test set
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# train and test
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))
> On 30 Apr 2015, at 10:39, Michael Eickenberg <michael.eickenb...@gmail.com>
> wrote:
>
> Hi Fabrizio,
>
> standardizing over train and test together is a classic way of leaking train
> data information into the test set. Standardizing train and test splits
> separately is OK or not depending on the situation: If you are interested in
> predicting correctly each individual test sample, then this is not
> appropriate, since you are linking the test data points. If you are
> interested in a global measure, then it may be OK to standardize over test
> samples.
>
> A way to avoid errors in this regard is to use a scikit-learn Pipeline with a
> StandardScaler prepended to your estimator. This object will estimate mean
> and sdev on the training set and standardize the test set using those
> estimated values. If this method worsens your results, there may be an
> unaccounted-for trend in your data.
>
> Michael
>
>
> On Thu, Apr 30, 2015 at 10:32 AM, Fabrizio Fasano
> <fabrizio.fas...@nemo.unipr.it <mailto:fabrizio.fas...@nemo.unipr.it>> wrote:
> Hi Kyle,
>
> I standardised separately the train and test set for each permutation, and it
> reduced the accuracy from 100% to 75%
> I can argue that standardising train and test set together introduced some
> bias resulting in an erroneously higher evaluation of the accuracy
>
> does it sound plausible?
>
> Thank You again,
>
> Best,
> Fabrizio
>
>
> > On 29 Apr 2015, at 17:36, Kyle Kastner <kastnerk...@gmail.com
> > <mailto:kastnerk...@gmail.com>> wrote:
> >
> > Data preprocessing is important. One thing you might want to do is get
> > your preprocessing scaling values over the training data - technically
> > getting the value over the whole dataset is not valid as that includes
> > the test data.
> >
> > It is hard to say whether 100% is believable or not, but you should
> > probably only take scaling over training data.
> >
> > On Wed, Apr 29, 2015 at 11:13 AM, Fabrizio Fasano <han...@gmail.com
> > <mailto:han...@gmail.com>> wrote:
> >> Dear experts,
> >>
> >> I’m experiencing a dramatic improvement in cross-validation when data are
> >> standardised
> >>
> >> I mean accuracy increased from 48% to 100% when I shift from X to X_scaled
> >> = preprocessing.scale(X)
> >>
> >> Does it make sense in your opinion?
> >>
> >> Thank You a lot for any suggestion,
> >>
> >> Fabrizio
> >>
> >>
> >>
> >> my CODE:
> >>
> >> import numpy as np
> >> from sklearn import preprocessing
> >> from sklearn.svm import LinearSVC
> >> from sklearn.cross_validation import StratifiedShuffleSplit
> >>
> >> # 14 features, 16 samples dataset
> >> data = loadtxt(“data.txt")
> >> y=data[:,0]
> >> X=data[:,1:15]
> >> X_scaled = preprocessing.scale(X)
> >>
> >> sss = StratifiedShuffleSplit(y, 10000, test_size=0.25, random_state=0)
> >> clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
> >> cv_scores=[]
> >>
> >> for train_index, test_index in sss:
> >> X_train, X_test = X_scaled[train_index], X_scaled[test_index]
> >> y_train, y_test = y[train_index], y[test_index]
> >> clf.fit(X_train, y_train)
> >> y_pred = clf.predict(X_test)
> >> cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))
> >>
> >> print "Accuracy ", np.ceil(100*np.mean(cv_scores)), "+/-",
> >> np.ceil(200*np.std(cv_scores))
> >>
> >>
> >>
> >> ------------------------------------------------------------------------------
> >> One dashboard for servers and applications across Physical-Virtual-Cloud
> >> Widest out-of-the-box monitoring support with 50+ applications
> >> Performance metrics, stats and reports that give you Actionable Insights
> >> Deep dive visibility with transaction tracing using APM Insight.
> >> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> >> <http://ad.doubleclick.net/ddm/clk/290420510;117567292;y>
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> Scikit-learn-general@lists.sourceforge.net
> >> <mailto:Scikit-learn-general@lists.sourceforge.net>
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
> >
> > ------------------------------------------------------------------------------
> > One dashboard for servers and applications across Physical-Virtual-Cloud
> > Widest out-of-the-box monitoring support with 50+ applications
> > Performance metrics, stats and reports that give you Actionable Insights
> > Deep dive visibility with transaction tracing using APM Insight.
> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> > <http://ad.doubleclick.net/ddm/clk/290420510;117567292;y>
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > <mailto:Scikit-learn-general@lists.sourceforge.net>
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> > <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> <http://ad.doubleclick.net/ddm/clk/290420510;117567292;y>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> <mailto:Scikit-learn-general@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y_______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general