Data preprocessing is important. One thing you might want to do is get
your preprocessing scaling values over the training data - technically
getting the value over the whole dataset is not valid as that includes
the test data.

It is hard to say whether 100% is believable or not, but you should
probably only take scaling over training data.

On Wed, Apr 29, 2015 at 11:13 AM, Fabrizio Fasano <han...@gmail.com> wrote:
> Dear experts,
>
> I’m experiencing a dramatic improvement in cross-validation when data are 
> standardised
>
> I mean accuracy increased from 48% to 100% when I shift from X to X_scaled = 
> preprocessing.scale(X)
>
> Does it make sense in your opinion?
>
> Thank You a lot for any suggestion,
>
> Fabrizio
>
>
>
> my CODE:
>
> import numpy as np
> from sklearn import preprocessing
> from sklearn.svm import LinearSVC
> from sklearn.cross_validation import StratifiedShuffleSplit
>
> # 14 features, 16 samples dataset
> data = loadtxt(“data.txt")
> y=data[:,0]
> X=data[:,1:15]
> X_scaled = preprocessing.scale(X)
>
> sss = StratifiedShuffleSplit(y, 10000, test_size=0.25, random_state=0)
> clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
> cv_scores=[]
>
> for train_index, test_index in sss:
>    X_train, X_test = X_scaled[train_index], X_scaled[test_index]
>    y_train, y_test = y[train_index], y[test_index]
>    clf.fit(X_train, y_train)
>    y_pred = clf.predict(X_test)
>    cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))
>
> print "Accuracy ", np.ceil(100*np.mean(cv_scores)), "+/-", 
> np.ceil(200*np.std(cv_scores))
>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to