[Scikit-learn-general] Causes for one class dominating?

Michael Waskom Sat, 28 Jan 2012 15:17:55 -0800

Hi Folks,

I hope you don't mind a question that's a mix of general machine
learning and scikit-learn. I'm happy to kick it over to metaoptimize,
but I'm not 100% sure I'm doing everything "right" from a scikit-learn
perspective so I thought it best to ask here first.


I'm doing classification of fMRI data using logistic regression.  I've
been playing around with things for the past couple days and was
getting accuracies right around or slightly above chance, which was
disappointing.
Initially, my code looked a bit like this:

pipeline = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
cv = LeaveOneLabelOut(labels)
acc = cross_val_score(pipeline, X, y, cv=cv).mean()
print acc

0.358599857854

Labels are an int in [1, 4] specifying which fmri run each sample came
from, and y has three classes.

When I went to inspect the predictions being made, though, I realized
in each split one class was almost completely dominating:

cv = LeaveOneLabelOut(labels)
for train, test in cv:
    pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
    print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0]

[58  0 11]
[67  0  3]
[ 0 70  0]
[ 0 67  0]

Which doesn't seem right at all.  I realized that if I disregard the
labels and just run 5-fold cross validation, though, the balance of
predictions looks much more like what I would expect:

cv = KFold(len(y), 5)
for train, test in cv:
    pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
    print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0]

[22 16 17]
[25 14 16]
[17 25 13]
[36  6 13]
[37  9 10]

(Although note the still relative dominance of the first class).  When
I go back and run the full analysis this way, I get accuracies more in
line with what I would have expected from previous fMRI studies in
this domain.

My design is slow event-related, so my samples should be independent
at least as far as HRF-blurring is considered.

I'm not considering error trials so the number of samples for each
class is not perfectly balanced, but participants are near ceiling and
thus they are very close:

cv = LeaveOneLabelOut(labels)
for train, test in cv:
    print histogram(y[train], 3)[0]

[71 67 69]
[71 68 67]
[70 69 67]
[70 69 70]


Apologies for the long explanation.  Two questions, really:

1) Does it look like I'm doing anything obviously wrong?

2) If not, can you help me build some intuition about why this is
happening and what it means? Or suggest things I could look  at in my
data/code to identify the source of the problem?

I really appreciate it!  Aside from this befuddling issue, I've found
scikit-learn an absolute delight to use!

Best,
Michael

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Causes for one class dominating?

Reply via email to