Hi Alex,
See my response to Yarick for some results from a binary
classification. I reran both the three-way and binary classification
with SVC, though, with similar results:
cv = LeaveOneLabelOut(bin_labels)
pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))])
print cross_val_score(pipe, bin_X, bin_y, cv=cv).mean()
for train, test in cv:
pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))])
print histogram(pipe.fit(bin_X[train],
bin_y[train]).predict(bin_X[test]), 2)[0]
0.496377606851
[ 0 68]
[ 0 70]
[ 0 67]
[ 0 69]
cv = LeaveOneLabelOut(tri_labels)
pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))])
print cross_val_score(pipe, tri_X, tri_y, cv=cv).mean()
for train, test in cv:
pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))])
print histogram(pipe.fit(tri_X[train],
tri_y[train]).predict(tri_X[test]), 3)[0]
0.386755821732
[20 0 48]
[29 1 40]
[ 2 0 65]
[ 0 69 0]
On Sun, Jan 29, 2012 at 12:38 PM, Alexandre Gramfort
<[email protected]> wrote:
> ok
>
> some more suggestions:
>
> - do you observe the same behavior with SVC which uses a different
> multiclass strategy?
> - what do you see when you inspect results obtained with binary
> predictions (keeping 2 classes at a time)?
>
> Alex
>
> On Sun, Jan 29, 2012 at 4:59 PM, Michael Waskom <[email protected]> wrote:
>> Hi Alex,
>>
>> No, each subject has four runs so I'm doing leave-one-run-out cross
>> validation in the original case. I'm estimating separate models within
>> each subject (as is common in fmri) so all my example code here would
>> be from within a for subject in subjects: loop, but this pattern of
>> weirdness is happening in every subject I've looked at so far.
>>
>> Michael
>>
>> On Sun, Jan 29, 2012 at 5:34 AM, Alexandre Gramfort
>> <[email protected]> wrote:
>>> hi,
>>>
>>> just a thought. You seem to be doing inter-subject prediction. In this case
>>> a 5 fold mixes subjects. A hint is that you may have a subject effect that
>>> acts as a confound.
>>>
>>> again just a thought ready the email quickly
>>>
>>> Alex
>>>
>>> On Sun, Jan 29, 2012 at 5:39 AM, Michael Waskom <[email protected]>
>>> wrote:
>>>> Hi Yarick, thanks for chiming in! I thought about spamming the pymvpa
>>>> list, but figured one at a time :)
>>>>
>>>> The scikit-learn LogisticRegression class uses one-vs-all in a
>>>> multiclass setting, although I also tried it with their one-vs-one
>>>> metaclassifier with similar "weird" results.
>>>>
>>>> Interestingly, though, I think the multiclass setting is a red
>>>> herring. For this dataset we also have a two-class condition (you can
>>>> think of the paradigm as a 3x2 design, although we're analyzing them
>>>> separately), which has the same thing happening:
>>>>
>>>> cv = LeaveOneLabelOut(labels)
>>>> print cross_val_score(pipe, X, y, cv=cv).mean()
>>>> for train, test in cv:
>>>> pipe = Pipeline([("scale", Scaler()), ("classify",
>>>> LogisticRegression())])
>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]
>>>>
>>>> 0.496377606851
>>>> [ 0 68]
>>>> [ 0 70]
>>>> [ 0 67]
>>>> [ 0 69]
>>>>
>>>> cv = LeaveOneLabelOut(np.random.permutation(labels))
>>>> pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
>>>> print cross_val_score(pipe, X, y, cv=cv).mean()
>>>> for train, test in cv:
>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]
>>>>
>>>> 0.532455733754
>>>> [40 28]
>>>> [36 34]
>>>> [33 34]
>>>> [31 38]
>>>>
>>>> Best,
>>>> Michael
>>>>
>>>> On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav Halchenko <[email protected]>
>>>> wrote:
>>>>> just to educate myself -- how sklearn does multiclass decisions in this
>>>>> case? if it is all pairs classification + voting, then the answer is
>>>>> simple -- ties, and the "first one in order" would take all those.
>>>>>
>>>>> but if there is no ties involved then, theoretically (since not sure if
>>>>> it is
>>>>> applicable to your data) it is easy to come up with non-linear scenarios
>>>>> for
>>>>> binary classification where 1 class would be better classified than the
>>>>> other
>>>>> one with a linear classifier... e.g. here is an example (sorry --
>>>>> pymvpa) with
>>>>> an embedded normal (i.e. both classes mean at the same spot but have
>>>>> significantly different variances)
>>>>>
>>>>> from mvpa2.suite import *
>>>>> ns, nf = 100, 10
>>>>> ds = dataset_wizard(
>>>>> np.vstack((
>>>>> np.random.normal(size=(ns, nf)),
>>>>> 10*np.random.normal(size=(ns, nf)))),
>>>>> targets=['narrow']*ns + ['wide']*ns,
>>>>> chunks=[0,1]*ns)
>>>>> cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(),
>>>>> enable_ca=['stats'])
>>>>> cv(ds).samples
>>>>> print cv.ca.stats
>>>>>
>>>>> yields
>>>>>
>>>>> ----------.
>>>>> predictions\targets narrow wide
>>>>> `------ ------ ------ P' N' FP FN PPV NPV TPR SPC
>>>>> FDR MCC AUC
>>>>> narrow 100 74 174 26 74 0 0.57 1 1 0.26
>>>>> 0.43 0.39 0.41
>>>>> wide 0 26 26 174 0 74 1 0.57 0.26 1
>>>>> 0 0.39 0.41
>>>>> Per target: ------ ------
>>>>> P 100 100
>>>>> N 100 100
>>>>> TP 100 26
>>>>> TN 26 100
>>>>> Summary \ Means: ------ ------ 100 100 37 37 0.79 0.79 0.63 0.63
>>>>> 0.21 0.39 0.41
>>>>> CHI^2 123.04 p=1.7e-26
>>>>> ACC 0.63
>>>>> ACC% 63
>>>>> # of sets 2
>>>>>
>>>>>
>>>>> I bet with a bit of creativity, classifier-dependent cases of similar
>>>>> cases could be found for linear underlying models.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> On Sat, 28 Jan 2012, Michael Waskom wrote:
>>>>>
>>>>>> Hi Folks,
>>>>>
>>>>>> I hope you don't mind a question that's a mix of general machine
>>>>>> learning and scikit-learn. I'm happy to kick it over to metaoptimize,
>>>>>> but I'm not 100% sure I'm doing everything "right" from a scikit-learn
>>>>>> perspective so I thought it best to ask here first.
>>>>>
>>>>>> I'm doing classification of fMRI data using logistic regression. I've
>>>>>> been playing around with things for the past couple days and was
>>>>>> getting accuracies right around or slightly above chance, which was
>>>>>> disappointing.
>>>>>> Initially, my code looked a bit like this:
>>>>>
>>>>>> pipeline = Pipeline([("scale", Scaler()), ("classify",
>>>>>> LogisticRegression())])
>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>> acc = cross_val_score(pipeline, X, y, cv=cv).mean()
>>>>>> print acc
>>>>>
>>>>>> 0.358599857854
>>>>>
>>>>>> Labels are an int in [1, 4] specifying which fmri run each sample came
>>>>>> from, and y has three classes.
>>>>>
>>>>>> When I went to inspect the predictions being made, though, I realized
>>>>>> in each split one class was almost completely dominating:
>>>>>
>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>> for train, test in cv:
>>>>>> pipe = Pipeline([("scale", Scaler()), ("classify",
>>>>>> LogisticRegression())])
>>>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0]
>>>>>
>>>>>> [58 0 11]
>>>>>> [67 0 3]
>>>>>> [ 0 70 0]
>>>>>> [ 0 67 0]
>>>>>
>>>>>> Which doesn't seem right at all. I realized that if I disregard the
>>>>>> labels and just run 5-fold cross validation, though, the balance of
>>>>>> predictions looks much more like what I would expect:
>>>>>
>>>>>> cv = KFold(len(y), 5)
>>>>>> for train, test in cv:
>>>>>> pipe = Pipeline([("scale", Scaler()), ("classify",
>>>>>> LogisticRegression())])
>>>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0]
>>>>>
>>>>>> [22 16 17]
>>>>>> [25 14 16]
>>>>>> [17 25 13]
>>>>>> [36 6 13]
>>>>>> [37 9 10]
>>>>>
>>>>>> (Although note the still relative dominance of the first class). When
>>>>>> I go back and run the full analysis this way, I get accuracies more in
>>>>>> line with what I would have expected from previous fMRI studies in
>>>>>> this domain.
>>>>>
>>>>>> My design is slow event-related, so my samples should be independent
>>>>>> at least as far as HRF-blurring is considered.
>>>>>
>>>>>> I'm not considering error trials so the number of samples for each
>>>>>> class is not perfectly balanced, but participants are near ceiling and
>>>>>> thus they are very close:
>>>>>
>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>> for train, test in cv:
>>>>>> print histogram(y[train], 3)[0]
>>>>>
>>>>>> [71 67 69]
>>>>>> [71 68 67]
>>>>>> [70 69 67]
>>>>>> [70 69 70]
>>>>>
>>>>>
>>>>>> Apologies for the long explanation. Two questions, really:
>>>>>
>>>>>> 1) Does it look like I'm doing anything obviously wrong?
>>>>>
>>>>>> 2) If not, can you help me build some intuition about why this is
>>>>>> happening and what it means? Or suggest things I could look at in my
>>>>>> data/code to identify the source of the problem?
>>>>>
>>>>>> I really appreciate it! Aside from this befuddling issue, I've found
>>>>>> scikit-learn an absolute delight to use!
>>>>>
>>>>>> Best,
>>>>>> Michael
>>>>> --
>>>>> =------------------------------------------------------------------=
>>>>> Keep in touch www.onerussian.com
>>>>> Yaroslav Halchenko www.ohloh.net/accounts/yarikoptic
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Try before you buy = See our experts in action!
>>>>> The most comprehensive online learning library for Microsoft developers
>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Try before you buy = See our experts in action!
>>>> The most comprehensive online learning library for Microsoft developers
>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>> ------------------------------------------------------------------------------
>>> Try before you buy = See our experts in action!
>>> The most comprehensive online learning library for Microsoft developers
>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>> http://p.sf.net/sfu/learndevnow-dev2
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>> ------------------------------------------------------------------------------
>> Try before you buy = See our experts in action!
>> The most comprehensive online learning library for Microsoft developers
>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>> Metro Style Apps, more. Free future releases when you subscribe now!
>> http://p.sf.net/sfu/learndevnow-dev2
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general