Thanks a lot:
Based on your suggestion I performed the following 2 tests (code below);
1) on the true labels, instead of defining train,test by StratifiedShuffleSplit
I performed 10000 permutations of train, test sets by
cross_validation.train_test_split, and accuracy resulted to be Accuracy: 98.00
(+/- 16.49)
2) on false labels obtained by permuting 100 times the true ones, I performed
100 permutation of train, test for every false label permutation, and accuracy
resulted to be Accuracy: 51.66 (+/- 50.32)
Do in your opinion my tests make sense?
If so I would be very happy, because I could be confident that my good (98%)
result is a true one and not a biased oneā¦
thank You again
Fabrizio
CODES:
# 1) true labels
niter=10000
scores=zeros(niter)
clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
for rs in range(0, niter):
X_train, X_test, y_train, y_test =
cross_validation.train_test_split(X_scaled, y, test_size=0.25,
random_state=rs+1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
scores[rs]=100*sum(y_test==y_pred)/(y_test.shape[0])
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
# 2) false labels obtained by permutation of the true ones
niter=100;
scores=zeros([100, niter])
# permutating single tests on false labels to check bad accuracy
for i in range (0, 100):
yfalse=np.random.permutation(y)
print "\nWhen a al manual permutation procedure is apllied"
clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
for rs in range(0, niter):
X_train, X_test, y_train, y_test =
cross_validation.train_test_split(X_scaled, yfalse, test_size=0.25,
random_state=rs+1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
scores[i,rs]=100*sum(y_test==y_pred)/(y_test.shape[0])
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
> On 27 Apr 2015, at 22:27, Andreas Mueller <t3k...@gmail.com> wrote:
>
> You changed the labels only once, and have a test-set size of 4? I would
> imagine that is where that comes from.
> If you repeat over different assignments, you will get 50/50.
>
> On 04/27/2015 11:33 AM, Fabrizio Fasano wrote:
>> Dear Andy,
>>
>> Yes, the classes have the same size, 8 and 8
>>
>> this is one example of code I used to cross validate classification (I used
>> here StratifiedShuffleSplit, but I also used other methods as leave one out
>> or simple 4-fold cross validation, and the result didn't change so much)
>>
>> from sklearn.cross_validation import StratifiedShuffleSplit
>> sss = StratifiedShuffleSplit(y, 100, test_size=0.25, random_state=0)
>> clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
>>
>> cv_scores=[]
>> for train_index, test_index in sss:
>> X_train, X_test = X_scaled[train_index], X_scaled[test_index]
>> y_train, y_test = y[train_index], y[test_index]
>> clf.fit(X_train, y_train)
>> y_pred = clf.predict(X_test)
>> cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))
>>
>> print "Accuracy ", np.ceil(100*np.mean(cv_scores)), "+/-",
>> np.ceil(200*np.std(cv_scores))
>>
>>
>>
>>
>> On Apr 26, 2015, at 7:50 PM, Andy wrote:
>>
>>> Your expectation is right, if you randomly assign labels, you shouldn't
>>> get more than 50% correct with a large enough dataset.
>>> I imagine there is some issue in how you shuffled the labels. Without
>>> the code, it is hard to tell.
>>> Are you sure the classes have the same size?
>>>
>>> On 04/26/2015 11:22 AM, Fabrizio Fasano wrote:
>>>> Dear Andreas,
>>>>
>>>> Thanks a lot for your help,
>>>>
>>>> about the random assignment of values to my labels y. What I mean is that
>>>> being suspicious about the too good performances, I changed the labels
>>>> manually, retaining the 50% 1,0 but in different orders, and the labels
>>>> were always predicted very well, with accuracy no lower than 60%. I mean,
>>>> by chance I aspected values lower than 50% as well as values higher than
>>>> 50%. I didn't perform an exhaustive test (I only did it manually for few
>>>> combinations)...
>>>>
>>>> Fabrizio
>>>> ------------------------------------------------------------------------------
>>>> One dashboard for servers and applications across Physical-Virtual-Cloud
>>>> Widest out-of-the-box monitoring support with 50+ applications
>>>> Performance metrics, stats and reports that give you Actionable Insights
>>>> Deep dive visibility with transaction tracing using APM Insight.
>>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>> ------------------------------------------------------------------------------
>>> One dashboard for servers and applications across Physical-Virtual-Cloud
>>> Widest out-of-the-box monitoring support with 50+ applications
>>> Performance metrics, stats and reports that give you Actionable Insights
>>> Deep dive visibility with transaction tracing using APM Insight.
>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>> ------------------------------------------------------------------------------
>> One dashboard for servers and applications across Physical-Virtual-Cloud
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable Insights
>> Deep dive visibility with transaction tracing using APM Insight.
>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general