Re: [Scikit-learn-general] bias in svm.LinearSVC classification accuracy in very small data sample? (Andreas Mueller)

Andreas Mueller Tue, 28 Apr 2015 14:04:55 -0700

For 1) the two methods should give the same result, except thatcurrently there is no stratification in train_test_split. So theStratifiedShuffleSplit should be better.

For 2) 51.66 for 100 permutations seems more reasonable than 60%.



On 04/28/2015 05:04 AM, Fabrizio Fasano wrote:

Thanks a lot:

Based on your suggestion I performed the following 2 tests (code below);

1) on the true labels, instead of defining train,test byStratifiedShuffleSplit I performed 10000 permutations of train, testsets by cross_validation.train_test_split, and accuracy resulted to beAccuracy: 98.00 (+/- 16.49)

2) on false labels obtained by permuting 100 times the true ones, Iperformed 100 permutation of train, test for every false labelpermutation, and accuracy resulted to be Accuracy: 51.66 (+/- 50.32)


Do in your opinion my tests make sense?

If so I would be very happy, because I could be confident that my good(98%) result is a true one and not a biased one…


thank You again

Fabrizio


CODES:

# 1) true labels
niter=10000
scores=zeros(niter)
clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
for rs in range(0, niter):

X_train, X_test, y_train, y_test =cross_validation.train_test_split(X_scaled, y, test_size=0.25,random_state=rs+1)

    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
scores[rs]=100*sum(y_test==y_pred)/(y_test.shape[0])

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


# 2) false labels obtained by permutation of the true ones
niter=100;
scores=zeros([100, niter])
# permutating single tests on false labels to check bad accuracy
for i in range (0, 100):
 yfalse=np.random.permutation(y)
 print "\nWhen a al manual permutation procedure is apllied"
 clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
 for rs in range(0, niter):

X_train, X_test, y_train, y_test =cross_validation.train_test_split(X_scaled, yfalse, test_size=0.25,random_state=rs+1)

     clf.fit(X_train, y_train)
     y_pred = clf.predict(X_test)
 scores[i,rs]=100*sum(y_test==y_pred)/(y_test.shape[0])

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

On 27 Apr 2015, at 22:27, Andreas Mueller <[email protected]<mailto:[email protected]>> wrote:


You changed the labels only once, and have a test-set size of 4? I would
imagine that is where that comes from.
If you repeat over different assignments, you will get 50/50.

On 04/27/2015 11:33 AM, Fabrizio Fasano wrote:

Dear Andy,

Yes, the classes have the same size, 8 and 8

this is one example of code I used to cross validate classification(I used here StratifiedShuffleSplit, but I also used other methodsas leave one out or simple 4-fold cross validation, and the resultdidn't change so much)


from sklearn.cross_validation import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(y, 100, test_size=0.25, random_state=0)
clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)

cv_scores=[]
for train_index, test_index in sss:
  X_train, X_test = X_scaled[train_index], X_scaled[test_index]
  y_train, y_test = y[train_index], y[test_index]
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))

print "Accuracy ", np.ceil(100*np.mean(cv_scores)), "+/-",np.ceil(200*np.std(cv_scores))





On Apr 26, 2015, at 7:50 PM, Andy wrote:

Your expectation is right, if you randomly assign labels, you shouldn't
get more than 50% correct with a large enough dataset.
I imagine there is some issue in how you shuffled the labels. Without
the code, it is hard to tell.
Are you sure the classes have the same size?

On 04/26/2015 11:22 AM, Fabrizio Fasano wrote:
Dear Andreas,

Thanks a lot for your help,
about the random assignment of values to my labels y. What I meanis that being suspicious about the too good performances, Ichanged the labels manually, retaining the 50% 1,0 but indifferent orders, and the labels were always predicted very well,with accuracy no lower than 60%. I mean, by chance I aspectedvalues lower than 50% as well as values higher than 50%. I didn'tperform an exhaustive test (I only did it manually for fewcombinations)...
Fabrizio
------------------------------------------------------------------------------
One dashboard for servers and applications acrossPhysical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you ActionableInsights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
One dashboard for servers and applications acrossPhysical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you ActionableInsights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] bias in svm.LinearSVC classification accuracy in very small data sample? (Andreas Mueller)

Reply via email to