Re: [Scikit-learn-general] Dramatic improvement by standardizing data?

Fabrizio Fasano Thu, 30 Apr 2015 01:06:18 -0700

Hi Kyle,

Thank You for the suggestion,


If I standardise only the training set, does the classifier work well on the 
non standardised test set? Do I have to do something on the test set before 
applying my classification test?

Fabrizio 


> On 29 Apr 2015, at 17:36, Kyle Kastner <kastnerk...@gmail.com> wrote:
> 
> Data preprocessing is important. One thing you might want to do is get
> your preprocessing scaling values over the training data - technically
> getting the value over the whole dataset is not valid as that includes
> the test data.
> 
> It is hard to say whether 100% is believable or not, but you should
> probably only take scaling over training data.
> 
> On Wed, Apr 29, 2015 at 11:13 AM, Fabrizio Fasano <han...@gmail.com> wrote:
>> Dear experts,
>> 
>> I’m experiencing a dramatic improvement in cross-validation when data are 
>> standardised
>> 
>> I mean accuracy increased from 48% to 100% when I shift from X to X_scaled = 
>> preprocessing.scale(X)
>> 
>> Does it make sense in your opinion?
>> 
>> Thank You a lot for any suggestion,
>> 
>> Fabrizio
>> 
>> 
>> 
>> my CODE:
>> 
>> import numpy as np
>> from sklearn import preprocessing
>> from sklearn.svm import LinearSVC
>> from sklearn.cross_validation import StratifiedShuffleSplit
>> 
>> # 14 features, 16 samples dataset
>> data = loadtxt(“data.txt")
>> y=data[:,0]
>> X=data[:,1:15]
>> X_scaled = preprocessing.scale(X)
>> 
>> sss = StratifiedShuffleSplit(y, 10000, test_size=0.25, random_state=0)
>> clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
>> cv_scores=[]
>> 
>> for train_index, test_index in sss:
>>   X_train, X_test = X_scaled[train_index], X_scaled[test_index]
>>   y_train, y_test = y[train_index], y[test_index]
>>   clf.fit(X_train, y_train)
>>   y_pred = clf.predict(X_test)
>>   cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))
>> 
>> print "Accuracy ", np.ceil(100*np.mean(cv_scores)), "+/-", 
>> np.ceil(200*np.std(cv_scores))
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> One dashboard for servers and applications across Physical-Virtual-Cloud
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable Insights
>> Deep dive visibility with transaction tracing using APM Insight.
>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud 
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Dramatic improvement by standardizing data?

Reply via email to