Re: [Scikit-learn-general] bias in svm.LinearSVC classification accuracy in very small data sample? (Andreas Mueller)

Fabrizio Fasano Tue, 28 Apr 2015 02:05:56 -0700

Thanks a lot:

Based on your suggestion I performed the following 2 tests (code below);


1) on the true labels, instead of defining train,test by StratifiedShuffleSplit 
I performed 10000 permutations of train, test sets by 
cross_validation.train_test_split, and accuracy resulted to be Accuracy: 98.00 
(+/- 16.49)


2) on false labels obtained by permuting 100 times the true ones, I performed 
100 permutation of train, test for every false label permutation, and accuracy 
resulted to be Accuracy: 51.66 (+/- 50.32)

Do in your opinion my tests make sense? 

If so I would be very happy, because I could be confident that my good (98%) 
result is a true one and not a biased one…

thank You again

Fabrizio


CODES:

# 1) true labels
niter=10000
scores=zeros(niter)
clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
for rs in range(0, niter):
    X_train, X_test, y_train, y_test = 
cross_validation.train_test_split(X_scaled, y, test_size=0.25, 
random_state=rs+1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    scores[rs]=100*sum(y_test==y_pred)/(y_test.shape[0])

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


# 2) false labels obtained by permutation of the true ones
niter=100;
scores=zeros([100, niter])
# permutating single tests on false labels to check bad accuracy
for i in range (0, 100):
 yfalse=np.random.permutation(y)
 print "\nWhen a al manual permutation procedure is apllied" 
 clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
 for rs in range(0, niter):
     X_train, X_test, y_train, y_test = 
cross_validation.train_test_split(X_scaled, yfalse, test_size=0.25, 
random_state=rs+1)
     clf.fit(X_train, y_train)
     y_pred = clf.predict(X_test)
       scores[i,rs]=100*sum(y_test==y_pred)/(y_test.shape[0])

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))



> On 27 Apr 2015, at 22:27, Andreas Mueller <t3k...@gmail.com> wrote:
> 
> You changed the labels only once, and have a test-set size of 4? I would 
> imagine that is where that comes from.
> If you repeat over different assignments, you will get 50/50.
> 
> On 04/27/2015 11:33 AM, Fabrizio Fasano wrote:
>> Dear Andy,
>> 
>> Yes, the classes have the same size, 8 and 8
>> 
>> this is one example of code I used to cross validate classification (I used 
>> here StratifiedShuffleSplit, but I also used other methods as leave one out 
>> or simple 4-fold cross validation, and the result didn't change so much)
>> 
>> from sklearn.cross_validation import StratifiedShuffleSplit
>> sss = StratifiedShuffleSplit(y, 100, test_size=0.25, random_state=0)
>> clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
>> 
>> cv_scores=[]
>> for train_index, test_index in sss:
>>   X_train, X_test = X_scaled[train_index], X_scaled[test_index]
>>   y_train, y_test = y[train_index], y[test_index]
>>   clf.fit(X_train, y_train)
>>   y_pred = clf.predict(X_test)
>>   cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))
>> 
>> print "Accuracy ", np.ceil(100*np.mean(cv_scores)), "+/-", 
>> np.ceil(200*np.std(cv_scores))
>> 
>> 
>> 
>> 
>> On Apr 26, 2015, at 7:50 PM, Andy wrote:
>> 
>>> Your expectation is right, if you randomly assign labels, you shouldn't
>>> get more than 50% correct with a large enough dataset.
>>> I imagine there is some issue in how you shuffled the labels. Without
>>> the code, it is hard to tell.
>>> Are you sure the classes have the same size?
>>> 
>>> On 04/26/2015 11:22 AM, Fabrizio Fasano wrote:
>>>> Dear Andreas,
>>>> 
>>>> Thanks a lot for your help,
>>>> 
>>>> about the random assignment of values to my labels y. What I mean is that 
>>>> being suspicious about the too good performances, I changed the labels 
>>>> manually, retaining the 50% 1,0 but in different orders, and the labels 
>>>> were always predicted very well, with accuracy no lower than 60%. I mean, 
>>>> by chance I aspected values lower than 50% as well as values higher than 
>>>> 50%. I didn't perform an exhaustive test (I only did it manually for few 
>>>> combinations)...
>>>> 
>>>> Fabrizio
>>>> ------------------------------------------------------------------------------
>>>> One dashboard for servers and applications across Physical-Virtual-Cloud
>>>> Widest out-of-the-box monitoring support with 50+ applications
>>>> Performance metrics, stats and reports that give you Actionable Insights
>>>> Deep dive visibility with transaction tracing using APM Insight.
>>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> 
>>> ------------------------------------------------------------------------------
>>> One dashboard for servers and applications across Physical-Virtual-Cloud
>>> Widest out-of-the-box monitoring support with 50+ applications
>>> Performance metrics, stats and reports that give you Actionable Insights
>>> Deep dive visibility with transaction tracing using APM Insight.
>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
>> ------------------------------------------------------------------------------
>> One dashboard for servers and applications across Physical-Virtual-Cloud
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable Insights
>> Deep dive visibility with transaction tracing using APM Insight.
>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud 
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] bias in svm.LinearSVC classification accuracy in very small data sample? (Andreas Mueller)

Reply via email to