Re: [Scikit-learn-general] Causes for one class dominating?

Yaroslav Halchenko Sat, 28 Jan 2012 18:09:51 -0800

just to educate myself -- how sklearn does multiclass decisions in this
case?  if it is all pairs classification + voting, then the answer is
simple -- ties, and the "first one in order" would take all those.


but if there is no ties involved then, theoretically (since not sure if it is
applicable to your data) it is easy to come up with non-linear scenarios for
binary classification where 1 class would be better classified than the other
one with a linear classifier...  e.g. here is an example (sorry -- pymvpa) with
an embedded normal (i.e. both classes mean at the same spot but have
significantly different variances)

    from mvpa2.suite import *
    ns, nf = 100, 10
    ds = dataset_wizard(
        np.vstack((
            np.random.normal(size=(ns, nf)),
            10*np.random.normal(size=(ns, nf)))),
        targets=['narrow']*ns + ['wide']*ns,
        chunks=[0,1]*ns)
    cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(),
                         enable_ca=['stats'])
    cv(ds).samples
    print cv.ca.stats

yields

    ----------.
    predictions\targets  narrow   wide
                `------  ------  ------  P'  N' FP FN  PPV  NPV  TPR  SPC  FDR  
MCC  AUC
           narrow         100      74   174  26 74  0 0.57   1    1  0.26 0.43 
0.39 0.41
            wide           0       26    26 174  0 74   1  0.57 0.26   1    0  
0.39 0.41
    Per target:          ------  ------
             P            100     100
             N            100     100
             TP           100      26
             TN            26     100
    Summary \ Means:     ------  ------ 100 100 37 37 0.79 0.79 0.63 0.63 0.21 
0.39 0.41
           CHI^2         123.04 p=1.7e-26
            ACC           0.63
            ACC%           63
         # of sets         2


I bet with a bit of creativity, classifier-dependent cases of similar
cases could be found for linear underlying models.

Cheers,

On Sat, 28 Jan 2012, Michael Waskom wrote:

> Hi Folks,

> I hope you don't mind a question that's a mix of general machine
> learning and scikit-learn. I'm happy to kick it over to metaoptimize,
> but I'm not 100% sure I'm doing everything "right" from a scikit-learn
> perspective so I thought it best to ask here first.

> I'm doing classification of fMRI data using logistic regression.  I've
> been playing around with things for the past couple days and was
> getting accuracies right around or slightly above chance, which was
> disappointing.
> Initially, my code looked a bit like this:

> pipeline = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
> cv = LeaveOneLabelOut(labels)
> acc = cross_val_score(pipeline, X, y, cv=cv).mean()
> print acc

> 0.358599857854

> Labels are an int in [1, 4] specifying which fmri run each sample came
> from, and y has three classes.

> When I went to inspect the predictions being made, though, I realized
> in each split one class was almost completely dominating:

> cv = LeaveOneLabelOut(labels)
> for train, test in cv:
>     pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
>     print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0]

> [58  0 11]
> [67  0  3]
> [ 0 70  0]
> [ 0 67  0]

> Which doesn't seem right at all.  I realized that if I disregard the
> labels and just run 5-fold cross validation, though, the balance of
> predictions looks much more like what I would expect:

> cv = KFold(len(y), 5)
> for train, test in cv:
>     pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
>     print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0]

> [22 16 17]
> [25 14 16]
> [17 25 13]
> [36  6 13]
> [37  9 10]

> (Although note the still relative dominance of the first class).  When
> I go back and run the full analysis this way, I get accuracies more in
> line with what I would have expected from previous fMRI studies in
> this domain.

> My design is slow event-related, so my samples should be independent
> at least as far as HRF-blurring is considered.

> I'm not considering error trials so the number of samples for each
> class is not perfectly balanced, but participants are near ceiling and
> thus they are very close:

> cv = LeaveOneLabelOut(labels)
> for train, test in cv:
>     print histogram(y[train], 3)[0]

> [71 67 69]
> [71 68 67]
> [70 69 67]
> [70 69 70]


> Apologies for the long explanation.  Two questions, really:

> 1) Does it look like I'm doing anything obviously wrong?

> 2) If not, can you help me build some intuition about why this is
> happening and what it means? Or suggest things I could look  at in my
> data/code to identify the source of the problem?

> I really appreciate it!  Aside from this befuddling issue, I've found
> scikit-learn an absolute delight to use!

> Best,
> Michael
-- 
=------------------------------------------------------------------=
Keep in touch                                     www.onerussian.com
Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Causes for one class dominating?

Reply via email to