Re: [Scikit-learn-general] Causes for one class dominating?

Alexandre Gramfort Sun, 29 Jan 2012 05:34:57 -0800

hi,

just a thought. You seem to be doing inter-subject prediction. In this case
a 5 fold mixes subjects. A hint is that you may have a subject effect that
acts as a confound.


again just a thought ready the email quickly

Alex

On Sun, Jan 29, 2012 at 5:39 AM, Michael Waskom <[email protected]> wrote:
> Hi Yarick, thanks for chiming in! I thought about spamming the pymvpa
> list, but figured one at a time :)
>
> The scikit-learn LogisticRegression class uses one-vs-all in a
> multiclass setting, although I also tried it with their one-vs-one
> metaclassifier with similar "weird" results.
>
> Interestingly, though, I think the multiclass setting is a red
> herring.  For this dataset we also have a two-class condition (you can
> think of the paradigm as a 3x2 design, although we're analyzing them
> separately), which has the same thing happening:
>
> cv = LeaveOneLabelOut(labels)
> print cross_val_score(pipe, X, y, cv=cv).mean()
> for train, test in cv:
>   pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
>   print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]
>
> 0.496377606851
> [ 0 68]
> [ 0 70]
> [ 0 67]
> [ 0 69]
>
> cv = LeaveOneLabelOut(np.random.permutation(labels))
> pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
> print cross_val_score(pipe, X, y, cv=cv).mean()
> for train, test in cv:
>   print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]
>
> 0.532455733754
> [40 28]
> [36 34]
> [33 34]
> [31 38]
>
> Best,
> Michael
>
> On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav Halchenko <[email protected]> 
> wrote:
>> just to educate myself -- how sklearn does multiclass decisions in this
>> case?  if it is all pairs classification + voting, then the answer is
>> simple -- ties, and the "first one in order" would take all those.
>>
>> but if there is no ties involved then, theoretically (since not sure if it is
>> applicable to your data) it is easy to come up with non-linear scenarios for
>> binary classification where 1 class would be better classified than the other
>> one with a linear classifier...  e.g. here is an example (sorry -- pymvpa) 
>> with
>> an embedded normal (i.e. both classes mean at the same spot but have
>> significantly different variances)
>>
>>    from mvpa2.suite import *
>>    ns, nf = 100, 10
>>    ds = dataset_wizard(
>>        np.vstack((
>>            np.random.normal(size=(ns, nf)),
>>            10*np.random.normal(size=(ns, nf)))),
>>        targets=['narrow']*ns + ['wide']*ns,
>>        chunks=[0,1]*ns)
>>    cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(),
>>                         enable_ca=['stats'])
>>    cv(ds).samples
>>    print cv.ca.stats
>>
>> yields
>>
>>    ----------.
>>    predictions\targets  narrow   wide
>>                `------  ------  ------  P'  N' FP FN  PPV  NPV  TPR  SPC  
>> FDR  MCC  AUC
>>           narrow         100      74   174  26 74  0 0.57   1    1  0.26 
>> 0.43 0.39 0.41
>>            wide           0       26    26 174  0 74   1  0.57 0.26   1    0 
>>  0.39 0.41
>>    Per target:          ------  ------
>>             P            100     100
>>             N            100     100
>>             TP           100      26
>>             TN            26     100
>>    Summary \ Means:     ------  ------ 100 100 37 37 0.79 0.79 0.63 0.63 
>> 0.21 0.39 0.41
>>           CHI^2         123.04 p=1.7e-26
>>            ACC           0.63
>>            ACC%           63
>>         # of sets         2
>>
>>
>> I bet with a bit of creativity, classifier-dependent cases of similar
>> cases could be found for linear underlying models.
>>
>> Cheers,
>>
>> On Sat, 28 Jan 2012, Michael Waskom wrote:
>>
>>> Hi Folks,
>>
>>> I hope you don't mind a question that's a mix of general machine
>>> learning and scikit-learn. I'm happy to kick it over to metaoptimize,
>>> but I'm not 100% sure I'm doing everything "right" from a scikit-learn
>>> perspective so I thought it best to ask here first.
>>
>>> I'm doing classification of fMRI data using logistic regression.  I've
>>> been playing around with things for the past couple days and was
>>> getting accuracies right around or slightly above chance, which was
>>> disappointing.
>>> Initially, my code looked a bit like this:
>>
>>> pipeline = Pipeline([("scale", Scaler()), ("classify", 
>>> LogisticRegression())])
>>> cv = LeaveOneLabelOut(labels)
>>> acc = cross_val_score(pipeline, X, y, cv=cv).mean()
>>> print acc
>>
>>> 0.358599857854
>>
>>> Labels are an int in [1, 4] specifying which fmri run each sample came
>>> from, and y has three classes.
>>
>>> When I went to inspect the predictions being made, though, I realized
>>> in each split one class was almost completely dominating:
>>
>>> cv = LeaveOneLabelOut(labels)
>>> for train, test in cv:
>>>     pipe = Pipeline([("scale", Scaler()), ("classify", 
>>> LogisticRegression())])
>>>     print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0]
>>
>>> [58  0 11]
>>> [67  0  3]
>>> [ 0 70  0]
>>> [ 0 67  0]
>>
>>> Which doesn't seem right at all.  I realized that if I disregard the
>>> labels and just run 5-fold cross validation, though, the balance of
>>> predictions looks much more like what I would expect:
>>
>>> cv = KFold(len(y), 5)
>>> for train, test in cv:
>>>     pipe = Pipeline([("scale", Scaler()), ("classify", 
>>> LogisticRegression())])
>>>     print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0]
>>
>>> [22 16 17]
>>> [25 14 16]
>>> [17 25 13]
>>> [36  6 13]
>>> [37  9 10]
>>
>>> (Although note the still relative dominance of the first class).  When
>>> I go back and run the full analysis this way, I get accuracies more in
>>> line with what I would have expected from previous fMRI studies in
>>> this domain.
>>
>>> My design is slow event-related, so my samples should be independent
>>> at least as far as HRF-blurring is considered.
>>
>>> I'm not considering error trials so the number of samples for each
>>> class is not perfectly balanced, but participants are near ceiling and
>>> thus they are very close:
>>
>>> cv = LeaveOneLabelOut(labels)
>>> for train, test in cv:
>>>     print histogram(y[train], 3)[0]
>>
>>> [71 67 69]
>>> [71 68 67]
>>> [70 69 67]
>>> [70 69 70]
>>
>>
>>> Apologies for the long explanation.  Two questions, really:
>>
>>> 1) Does it look like I'm doing anything obviously wrong?
>>
>>> 2) If not, can you help me build some intuition about why this is
>>> happening and what it means? Or suggest things I could look  at in my
>>> data/code to identify the source of the problem?
>>
>>> I really appreciate it!  Aside from this befuddling issue, I've found
>>> scikit-learn an absolute delight to use!
>>
>>> Best,
>>> Michael
>> --
>> =------------------------------------------------------------------=
>> Keep in touch                                     www.onerussian.com
>> Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic
>>
>> ------------------------------------------------------------------------------
>> Try before you buy = See our experts in action!
>> The most comprehensive online learning library for Microsoft developers
>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>> Metro Style Apps, more. Free future releases when you subscribe now!
>> http://p.sf.net/sfu/learndevnow-dev2
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Causes for one class dominating?

Reply via email to