Re: [Scikit-learn-general] Causes for one class dominating?

Michael Waskom Sun, 29 Jan 2012 08:00:24 -0800

Hi Alex,

No, each subject has four runs so I'm doing leave-one-run-out cross
validation in the original case. I'm estimating separate models within
each subject (as is common in fmri) so all my example code here would
be from within a for subject in subjects: loop, but this pattern of
weirdness is happening in every subject I've looked at so far.


Michael

On Sun, Jan 29, 2012 at 5:34 AM, Alexandre Gramfort
<[email protected]> wrote:
> hi,
>
> just a thought. You seem to be doing inter-subject prediction. In this case
> a 5 fold mixes subjects. A hint is that you may have a subject effect that
> acts as a confound.
>
> again just a thought ready the email quickly
>
> Alex
>
> On Sun, Jan 29, 2012 at 5:39 AM, Michael Waskom <[email protected]> wrote:
>> Hi Yarick, thanks for chiming in! I thought about spamming the pymvpa
>> list, but figured one at a time :)
>>
>> The scikit-learn LogisticRegression class uses one-vs-all in a
>> multiclass setting, although I also tried it with their one-vs-one
>> metaclassifier with similar "weird" results.
>>
>> Interestingly, though, I think the multiclass setting is a red
>> herring.  For this dataset we also have a two-class condition (you can
>> think of the paradigm as a 3x2 design, although we're analyzing them
>> separately), which has the same thing happening:
>>
>> cv = LeaveOneLabelOut(labels)
>> print cross_val_score(pipe, X, y, cv=cv).mean()
>> for train, test in cv:
>>   pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
>>   print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]
>>
>> 0.496377606851
>> [ 0 68]
>> [ 0 70]
>> [ 0 67]
>> [ 0 69]
>>
>> cv = LeaveOneLabelOut(np.random.permutation(labels))
>> pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
>> print cross_val_score(pipe, X, y, cv=cv).mean()
>> for train, test in cv:
>>   print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]
>>
>> 0.532455733754
>> [40 28]
>> [36 34]
>> [33 34]
>> [31 38]
>>
>> Best,
>> Michael
>>
>> On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav Halchenko <[email protected]> 
>> wrote:
>>> just to educate myself -- how sklearn does multiclass decisions in this
>>> case?  if it is all pairs classification + voting, then the answer is
>>> simple -- ties, and the "first one in order" would take all those.
>>>
>>> but if there is no ties involved then, theoretically (since not sure if it 
>>> is
>>> applicable to your data) it is easy to come up with non-linear scenarios for
>>> binary classification where 1 class would be better classified than the 
>>> other
>>> one with a linear classifier...  e.g. here is an example (sorry -- pymvpa) 
>>> with
>>> an embedded normal (i.e. both classes mean at the same spot but have
>>> significantly different variances)
>>>
>>>    from mvpa2.suite import *
>>>    ns, nf = 100, 10
>>>    ds = dataset_wizard(
>>>        np.vstack((
>>>            np.random.normal(size=(ns, nf)),
>>>            10*np.random.normal(size=(ns, nf)))),
>>>        targets=['narrow']*ns + ['wide']*ns,
>>>        chunks=[0,1]*ns)
>>>    cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(),
>>>                         enable_ca=['stats'])
>>>    cv(ds).samples
>>>    print cv.ca.stats
>>>
>>> yields
>>>
>>>    ----------.
>>>    predictions\targets  narrow   wide
>>>                `------  ------  ------  P'  N' FP FN  PPV  NPV  TPR  SPC  
>>> FDR  MCC  AUC
>>>           narrow         100      74   174  26 74  0 0.57   1    1  0.26 
>>> 0.43 0.39 0.41
>>>            wide           0       26    26 174  0 74   1  0.57 0.26   1    
>>> 0  0.39 0.41
>>>    Per target:          ------  ------
>>>             P            100     100
>>>             N            100     100
>>>             TP           100      26
>>>             TN            26     100
>>>    Summary \ Means:     ------  ------ 100 100 37 37 0.79 0.79 0.63 0.63 
>>> 0.21 0.39 0.41
>>>           CHI^2         123.04 p=1.7e-26
>>>            ACC           0.63
>>>            ACC%           63
>>>         # of sets         2
>>>
>>>
>>> I bet with a bit of creativity, classifier-dependent cases of similar
>>> cases could be found for linear underlying models.
>>>
>>> Cheers,
>>>
>>> On Sat, 28 Jan 2012, Michael Waskom wrote:
>>>
>>>> Hi Folks,
>>>
>>>> I hope you don't mind a question that's a mix of general machine
>>>> learning and scikit-learn. I'm happy to kick it over to metaoptimize,
>>>> but I'm not 100% sure I'm doing everything "right" from a scikit-learn
>>>> perspective so I thought it best to ask here first.
>>>
>>>> I'm doing classification of fMRI data using logistic regression.  I've
>>>> been playing around with things for the past couple days and was
>>>> getting accuracies right around or slightly above chance, which was
>>>> disappointing.
>>>> Initially, my code looked a bit like this:
>>>
>>>> pipeline = Pipeline([("scale", Scaler()), ("classify", 
>>>> LogisticRegression())])
>>>> cv = LeaveOneLabelOut(labels)
>>>> acc = cross_val_score(pipeline, X, y, cv=cv).mean()
>>>> print acc
>>>
>>>> 0.358599857854
>>>
>>>> Labels are an int in [1, 4] specifying which fmri run each sample came
>>>> from, and y has three classes.
>>>
>>>> When I went to inspect the predictions being made, though, I realized
>>>> in each split one class was almost completely dominating:
>>>
>>>> cv = LeaveOneLabelOut(labels)
>>>> for train, test in cv:
>>>>     pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>> LogisticRegression())])
>>>>     print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0]
>>>
>>>> [58  0 11]
>>>> [67  0  3]
>>>> [ 0 70  0]
>>>> [ 0 67  0]
>>>
>>>> Which doesn't seem right at all.  I realized that if I disregard the
>>>> labels and just run 5-fold cross validation, though, the balance of
>>>> predictions looks much more like what I would expect:
>>>
>>>> cv = KFold(len(y), 5)
>>>> for train, test in cv:
>>>>     pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>> LogisticRegression())])
>>>>     print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0]
>>>
>>>> [22 16 17]
>>>> [25 14 16]
>>>> [17 25 13]
>>>> [36  6 13]
>>>> [37  9 10]
>>>
>>>> (Although note the still relative dominance of the first class).  When
>>>> I go back and run the full analysis this way, I get accuracies more in
>>>> line with what I would have expected from previous fMRI studies in
>>>> this domain.
>>>
>>>> My design is slow event-related, so my samples should be independent
>>>> at least as far as HRF-blurring is considered.
>>>
>>>> I'm not considering error trials so the number of samples for each
>>>> class is not perfectly balanced, but participants are near ceiling and
>>>> thus they are very close:
>>>
>>>> cv = LeaveOneLabelOut(labels)
>>>> for train, test in cv:
>>>>     print histogram(y[train], 3)[0]
>>>
>>>> [71 67 69]
>>>> [71 68 67]
>>>> [70 69 67]
>>>> [70 69 70]
>>>
>>>
>>>> Apologies for the long explanation.  Two questions, really:
>>>
>>>> 1) Does it look like I'm doing anything obviously wrong?
>>>
>>>> 2) If not, can you help me build some intuition about why this is
>>>> happening and what it means? Or suggest things I could look  at in my
>>>> data/code to identify the source of the problem?
>>>
>>>> I really appreciate it!  Aside from this befuddling issue, I've found
>>>> scikit-learn an absolute delight to use!
>>>
>>>> Best,
>>>> Michael
>>> --
>>> =------------------------------------------------------------------=
>>> Keep in touch                                     www.onerussian.com
>>> Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic
>>>
>>> ------------------------------------------------------------------------------
>>> Try before you buy = See our experts in action!
>>> The most comprehensive online learning library for Microsoft developers
>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>> http://p.sf.net/sfu/learndevnow-dev2
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>> ------------------------------------------------------------------------------
>> Try before you buy = See our experts in action!
>> The most comprehensive online learning library for Microsoft developers
>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>> Metro Style Apps, more. Free future releases when you subscribe now!
>> http://p.sf.net/sfu/learndevnow-dev2
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Causes for one class dominating?

Reply via email to