Hi Yarick, thanks for chiming in! I thought about spamming the pymvpa
list, but figured one at a time :)
The scikit-learn LogisticRegression class uses one-vs-all in a
multiclass setting, although I also tried it with their one-vs-one
metaclassifier with similar "weird" results.
Interestingly, though, I think the multiclass setting is a red
herring. For this dataset we also have a two-class condition (you can
think of the paradigm as a 3x2 design, although we're analyzing them
separately), which has the same thing happening:
cv = LeaveOneLabelOut(labels)
print cross_val_score(pipe, X, y, cv=cv).mean()
for train, test in cv:
pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]
0.496377606851
[ 0 68]
[ 0 70]
[ 0 67]
[ 0 69]
cv = LeaveOneLabelOut(np.random.permutation(labels))
pipe = Pipeline([("scale", Scaler()), ("classify", LogisticRegression())])
print cross_val_score(pipe, X, y, cv=cv).mean()
for train, test in cv:
print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]
0.532455733754
[40 28]
[36 34]
[33 34]
[31 38]
Best,
Michael
On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav Halchenko <[email protected]> wrote:
> just to educate myself -- how sklearn does multiclass decisions in this
> case? if it is all pairs classification + voting, then the answer is
> simple -- ties, and the "first one in order" would take all those.
>
> but if there is no ties involved then, theoretically (since not sure if it is
> applicable to your data) it is easy to come up with non-linear scenarios for
> binary classification where 1 class would be better classified than the other
> one with a linear classifier... e.g. here is an example (sorry -- pymvpa)
> with
> an embedded normal (i.e. both classes mean at the same spot but have
> significantly different variances)
>
> from mvpa2.suite import *
> ns, nf = 100, 10
> ds = dataset_wizard(
> np.vstack((
> np.random.normal(size=(ns, nf)),
> 10*np.random.normal(size=(ns, nf)))),
> targets=['narrow']*ns + ['wide']*ns,
> chunks=[0,1]*ns)
> cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(),
> enable_ca=['stats'])
> cv(ds).samples
> print cv.ca.stats
>
> yields
>
> ----------.
> predictions\targets narrow wide
> `------ ------ ------ P' N' FP FN PPV NPV TPR SPC FDR
> MCC AUC
> narrow 100 74 174 26 74 0 0.57 1 1 0.26 0.43
> 0.39 0.41
> wide 0 26 26 174 0 74 1 0.57 0.26 1 0
> 0.39 0.41
> Per target: ------ ------
> P 100 100
> N 100 100
> TP 100 26
> TN 26 100
> Summary \ Means: ------ ------ 100 100 37 37 0.79 0.79 0.63 0.63 0.21
> 0.39 0.41
> CHI^2 123.04 p=1.7e-26
> ACC 0.63
> ACC% 63
> # of sets 2
>
>
> I bet with a bit of creativity, classifier-dependent cases of similar
> cases could be found for linear underlying models.
>
> Cheers,
>
> On Sat, 28 Jan 2012, Michael Waskom wrote:
>
>> Hi Folks,
>
>> I hope you don't mind a question that's a mix of general machine
>> learning and scikit-learn. I'm happy to kick it over to metaoptimize,
>> but I'm not 100% sure I'm doing everything "right" from a scikit-learn
>> perspective so I thought it best to ask here first.
>
>> I'm doing classification of fMRI data using logistic regression. I've
>> been playing around with things for the past couple days and was
>> getting accuracies right around or slightly above chance, which was
>> disappointing.
>> Initially, my code looked a bit like this:
>
>> pipeline = Pipeline([("scale", Scaler()), ("classify",
>> LogisticRegression())])
>> cv = LeaveOneLabelOut(labels)
>> acc = cross_val_score(pipeline, X, y, cv=cv).mean()
>> print acc
>
>> 0.358599857854
>
>> Labels are an int in [1, 4] specifying which fmri run each sample came
>> from, and y has three classes.
>
>> When I went to inspect the predictions being made, though, I realized
>> in each split one class was almost completely dominating:
>
>> cv = LeaveOneLabelOut(labels)
>> for train, test in cv:
>> pipe = Pipeline([("scale", Scaler()), ("classify",
>> LogisticRegression())])
>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0]
>
>> [58 0 11]
>> [67 0 3]
>> [ 0 70 0]
>> [ 0 67 0]
>
>> Which doesn't seem right at all. I realized that if I disregard the
>> labels and just run 5-fold cross validation, though, the balance of
>> predictions looks much more like what I would expect:
>
>> cv = KFold(len(y), 5)
>> for train, test in cv:
>> pipe = Pipeline([("scale", Scaler()), ("classify",
>> LogisticRegression())])
>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 3)[0]
>
>> [22 16 17]
>> [25 14 16]
>> [17 25 13]
>> [36 6 13]
>> [37 9 10]
>
>> (Although note the still relative dominance of the first class). When
>> I go back and run the full analysis this way, I get accuracies more in
>> line with what I would have expected from previous fMRI studies in
>> this domain.
>
>> My design is slow event-related, so my samples should be independent
>> at least as far as HRF-blurring is considered.
>
>> I'm not considering error trials so the number of samples for each
>> class is not perfectly balanced, but participants are near ceiling and
>> thus they are very close:
>
>> cv = LeaveOneLabelOut(labels)
>> for train, test in cv:
>> print histogram(y[train], 3)[0]
>
>> [71 67 69]
>> [71 68 67]
>> [70 69 67]
>> [70 69 70]
>
>
>> Apologies for the long explanation. Two questions, really:
>
>> 1) Does it look like I'm doing anything obviously wrong?
>
>> 2) If not, can you help me build some intuition about why this is
>> happening and what it means? Or suggest things I could look at in my
>> data/code to identify the source of the problem?
>
>> I really appreciate it! Aside from this befuddling issue, I've found
>> scikit-learn an absolute delight to use!
>
>> Best,
>> Michael
> --
> =------------------------------------------------------------------=
> Keep in touch www.onerussian.com
> Yaroslav Halchenko www.ohloh.net/accounts/yarikoptic
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general