Re: [Scikit-learn-general] Causes for one class dominating?

Michael Waskom Sat, 28 Jan 2012 20:40:42 -0800

Hi Yarick, thanks for chiming in! I thought about spamming the pymvpa
list, but figured one at a time :)


The scikit-learn LogisticRegression class uses one-vs-all in a
multiclass setting, although I also tried it with their one-vs-one
metaclassifier with similar "weird" results.

Interestingly, though, I think the multiclass setting is a red
herring.  For this dataset we also have a two-class condition (you can
think of the paradigm as a 3x2 design, although we're analyzing them
separately), which has the same thing happening:

cv = LeaveOneLabelOut(labels)
print cross_val_score(pipe, X, y, cv=cv).mean()
for train, test in cv:
   pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
   print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]

0.496377606851
[ 0 68]
[ 0 70]
[ 0 67]
[ 0 69]

cv = LeaveOneLabelOut(np.random.permutation(labels))
pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
print cross_val_score(pipe, X, y, cv=cv).mean()
for train, test in cv:
   print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]

0.532455733754
[40 28]
[36 34]
[33 34]
[31 38]

Best,
Michael

On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav Halchenko <[email protected]> wrote:
> just to educate myself -- how sklearn does multiclass decisions in this
> case?  if it is all pairs classification + voting, then the answer is
> simple -- ties, and the "first one in order" would take all those.
>
> but if there is no ties involved then, theoretically (since not sure if it is
> applicable to your data) it is easy to come up with non-linear scenarios for
> binary classification where 1 class would be better classified than the other
> one with a linear classifier...  e.g. here is an example (sorry -- pymvpa) 
> with
> an embedded normal (i.e. both classes mean at the same spot but have
> significantly different variances)
>
>    from mvpa2.suite import *
>    ns, nf = 100, 10
>    ds = dataset_wizard(
>        np.vstack((
>            np.random.normal(size=(ns, nf)),
>            10*np.random.normal(size=(ns, nf)))),
>        targets=['narrow']*ns + ['wide']*ns,
>        chunks=[0,1]*ns)
>    cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(),
>                         enable_ca=['stats'])
>    cv(ds).samples
>    print cv.ca.stats
>
> yields
>
>    ----------.
>    predictions\targets  narrow   wide
>                `------  ------  ------  P'  N' FP FN  PPV  NPV  TPR  SPC  FDR 
>  MCC  AUC
>           narrow         100      74   174  26 74  0 0.57   1    1  0.26 0.43 
> 0.39 0.41
>            wide           0       26    26 174  0 74   1  0.57 0.26   1    0  
> 0.39 0.41
>    Per target:          ------  ------
>             P            100     100
>             N            100     100
>             TP           100      26
>             TN            26     100
>    Summary \ Means:     ------  ------ 100 100 37 37 0.79 0.79 0.63 0.63 0.21 
> 0.39 0.41
>           CHI^2         123.04 p=1.7e-26
>            ACC           0.63
>            ACC%           63
>         # of sets         2
>
>
> I bet with a bit of creativity, classifier-dependent cases of similar
> cases could be found for linear underlying models.
>
> Cheers,
>
> On Sat, 28 Jan 2012, Michael Waskom wrote:
>
>> Hi Folks,
>
>> I hope you don't mind a question that's a mix of general machine
>> learning and scikit-learn. I'm happy to kick it over to metaoptimize,
>> but I'm not 100% sure I'm doing everything "right" from a scikit-learn
>> perspective so I thought it best to ask here first.
>
>> I'm doing classification of fMRI data using logistic regression.  I've
>> been playing around with things for the past couple days and was
>> getting accuracies right around or slightly above chance, which was
>> disappointing.
>> Initially, my code looked a bit like this:
>
>> pipeline = Pipeline([("scale", Scaler()), ("classify", 
>> LogisticRegression())])
>> cv = LeaveOneLabelOut(labels)
>> acc = cross_val_score(pipeline, X, y, cv=cv).mean()
>> print acc
>
>> 0.358599857854
>
>> Labels are an int in [1, 4] specifying which fmri run each sample came
>> from, and y has three classes.
>
>> When I went to inspect the predictions being made, though, I realized
>> in each split one class was almost completely dominating:
>
>> cv = LeaveOneLabelOut(labels)
>> for train, test in cv:
>>     pipe = Pipeline([("scale", Scaler()), ("classify", 
>> LogisticRegression())])
>>     print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0]
>
>> [58  0 11]
>> [67  0  3]
>> [ 0 70  0]
>> [ 0 67  0]
>
>> Which doesn't seem right at all.  I realized that if I disregard the
>> labels and just run 5-fold cross validation, though, the balance of
>> predictions looks much more like what I would expect:
>
>> cv = KFold(len(y), 5)
>> for train, test in cv:
>>     pipe = Pipeline([("scale", Scaler()), ("classify", 
>> LogisticRegression())])
>>     print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0]
>
>> [22 16 17]
>> [25 14 16]
>> [17 25 13]
>> [36  6 13]
>> [37  9 10]
>
>> (Although note the still relative dominance of the first class).  When
>> I go back and run the full analysis this way, I get accuracies more in
>> line with what I would have expected from previous fMRI studies in
>> this domain.
>
>> My design is slow event-related, so my samples should be independent
>> at least as far as HRF-blurring is considered.
>
>> I'm not considering error trials so the number of samples for each
>> class is not perfectly balanced, but participants are near ceiling and
>> thus they are very close:
>
>> cv = LeaveOneLabelOut(labels)
>> for train, test in cv:
>>     print histogram(y[train], 3)[0]
>
>> [71 67 69]
>> [71 68 67]
>> [70 69 67]
>> [70 69 70]
>
>
>> Apologies for the long explanation.  Two questions, really:
>
>> 1) Does it look like I'm doing anything obviously wrong?
>
>> 2) If not, can you help me build some intuition about why this is
>> happening and what it means? Or suggest things I could look  at in my
>> data/code to identify the source of the problem?
>
>> I really appreciate it!  Aside from this befuddling issue, I've found
>> scikit-learn an absolute delight to use!
>
>> Best,
>> Michael
> --
> =------------------------------------------------------------------=
> Keep in touch                                     www.onerussian.com
> Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Causes for one class dominating?

Reply via email to