Aha, this does indeed suggest something strange: http://web.mit.edu/mwaskom/www/pca.png
I'm going to dig into this some more, but I don't really have any strong intuitions to guide me here so if anything pops out at you from that do feel free to speak up :) Michael On Sun, Jan 29, 2012 at 1:14 PM, Alexandre Gramfort <[email protected]> wrote: > hum... > > final suggestion: I would try to visualize a 2D or 3D PCA to see if it > can give me some intuition on what's happening. > > Alex > > On Sun, Jan 29, 2012 at 9:58 PM, Michael Waskom <[email protected]> wrote: >> Hi Alex, >> >> See my response to Yarick for some results from a binary >> classification. I reran both the three-way and binary classification >> with SVC, though, with similar results: >> >> cv = LeaveOneLabelOut(bin_labels) >> pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))]) >> print cross_val_score(pipe, bin_X, bin_y, cv=cv).mean() >> for train, test in cv: >> pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))]) >> print histogram(pipe.fit(bin_X[train], >> bin_y[train]).predict(bin_X[test]), 2)[0] >> >> 0.496377606851 >> [ 0 68] >> [ 0 70] >> [ 0 67] >> [ 0 69] >> >> cv = LeaveOneLabelOut(tri_labels) >> pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))]) >> print cross_val_score(pipe, tri_X, tri_y, cv=cv).mean() >> for train, test in cv: >> pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))]) >> print histogram(pipe.fit(tri_X[train], >> tri_y[train]).predict(tri_X[test]), 3)[0] >> >> 0.386755821732 >> [20 0 48] >> [29 1 40] >> [ 2 0 65] >> [ 0 69 0] >> >> On Sun, Jan 29, 2012 at 12:38 PM, Alexandre Gramfort >> <[email protected]> wrote: >>> ok >>> >>> some more suggestions: >>> >>> - do you observe the same behavior with SVC which uses a different >>> multiclass strategy? >>> - what do you see when you inspect results obtained with binary >>> predictions (keeping 2 classes at a time)? >>> >>> Alex >>> >>> On Sun, Jan 29, 2012 at 4:59 PM, Michael Waskom <[email protected]> >>> wrote: >>>> Hi Alex, >>>> >>>> No, each subject has four runs so I'm doing leave-one-run-out cross >>>> validation in the original case. I'm estimating separate models within >>>> each subject (as is common in fmri) so all my example code here would >>>> be from within a for subject in subjects: loop, but this pattern of >>>> weirdness is happening in every subject I've looked at so far. >>>> >>>> Michael >>>> >>>> On Sun, Jan 29, 2012 at 5:34 AM, Alexandre Gramfort >>>> <[email protected]> wrote: >>>>> hi, >>>>> >>>>> just a thought. You seem to be doing inter-subject prediction. In this >>>>> case >>>>> a 5 fold mixes subjects. A hint is that you may have a subject effect that >>>>> acts as a confound. >>>>> >>>>> again just a thought ready the email quickly >>>>> >>>>> Alex >>>>> >>>>> On Sun, Jan 29, 2012 at 5:39 AM, Michael Waskom <[email protected]> >>>>> wrote: >>>>>> Hi Yarick, thanks for chiming in! I thought about spamming the pymvpa >>>>>> list, but figured one at a time :) >>>>>> >>>>>> The scikit-learn LogisticRegression class uses one-vs-all in a >>>>>> multiclass setting, although I also tried it with their one-vs-one >>>>>> metaclassifier with similar "weird" results. >>>>>> >>>>>> Interestingly, though, I think the multiclass setting is a red >>>>>> herring. For this dataset we also have a two-class condition (you can >>>>>> think of the paradigm as a 3x2 design, although we're analyzing them >>>>>> separately), which has the same thing happening: >>>>>> >>>>>> cv = LeaveOneLabelOut(labels) >>>>>> print cross_val_score(pipe, X, y, cv=cv).mean() >>>>>> for train, test in cv: >>>>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>>>> LogisticRegression())]) >>>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0] >>>>>> >>>>>> 0.496377606851 >>>>>> [ 0 68] >>>>>> [ 0 70] >>>>>> [ 0 67] >>>>>> [ 0 69] >>>>>> >>>>>> cv = LeaveOneLabelOut(np.random.permutation(labels)) >>>>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>>>> LogisticRegression())]) >>>>>> print cross_val_score(pipe, X, y, cv=cv).mean() >>>>>> for train, test in cv: >>>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0] >>>>>> >>>>>> 0.532455733754 >>>>>> [40 28] >>>>>> [36 34] >>>>>> [33 34] >>>>>> [31 38] >>>>>> >>>>>> Best, >>>>>> Michael >>>>>> >>>>>> On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav Halchenko >>>>>> <[email protected]> wrote: >>>>>>> just to educate myself -- how sklearn does multiclass decisions in this >>>>>>> case? if it is all pairs classification + voting, then the answer is >>>>>>> simple -- ties, and the "first one in order" would take all those. >>>>>>> >>>>>>> but if there is no ties involved then, theoretically (since not sure if >>>>>>> it is >>>>>>> applicable to your data) it is easy to come up with non-linear >>>>>>> scenarios for >>>>>>> binary classification where 1 class would be better classified than the >>>>>>> other >>>>>>> one with a linear classifier... e.g. here is an example (sorry -- >>>>>>> pymvpa) with >>>>>>> an embedded normal (i.e. both classes mean at the same spot but have >>>>>>> significantly different variances) >>>>>>> >>>>>>> from mvpa2.suite import * >>>>>>> ns, nf = 100, 10 >>>>>>> ds = dataset_wizard( >>>>>>> np.vstack(( >>>>>>> np.random.normal(size=(ns, nf)), >>>>>>> 10*np.random.normal(size=(ns, nf)))), >>>>>>> targets=['narrow']*ns + ['wide']*ns, >>>>>>> chunks=[0,1]*ns) >>>>>>> cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(), >>>>>>> enable_ca=['stats']) >>>>>>> cv(ds).samples >>>>>>> print cv.ca.stats >>>>>>> >>>>>>> yields >>>>>>> >>>>>>> ----------. >>>>>>> predictions\targets narrow wide >>>>>>> `------ ------ ------ P' N' FP FN PPV NPV TPR >>>>>>> SPC FDR MCC AUC >>>>>>> narrow 100 74 174 26 74 0 0.57 1 1 >>>>>>> 0.26 0.43 0.39 0.41 >>>>>>> wide 0 26 26 174 0 74 1 0.57 0.26 1 >>>>>>> 0 0.39 0.41 >>>>>>> Per target: ------ ------ >>>>>>> P 100 100 >>>>>>> N 100 100 >>>>>>> TP 100 26 >>>>>>> TN 26 100 >>>>>>> Summary \ Means: ------ ------ 100 100 37 37 0.79 0.79 0.63 >>>>>>> 0.63 0.21 0.39 0.41 >>>>>>> CHI^2 123.04 p=1.7e-26 >>>>>>> ACC 0.63 >>>>>>> ACC% 63 >>>>>>> # of sets 2 >>>>>>> >>>>>>> >>>>>>> I bet with a bit of creativity, classifier-dependent cases of similar >>>>>>> cases could be found for linear underlying models. >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> On Sat, 28 Jan 2012, Michael Waskom wrote: >>>>>>> >>>>>>>> Hi Folks, >>>>>>> >>>>>>>> I hope you don't mind a question that's a mix of general machine >>>>>>>> learning and scikit-learn. I'm happy to kick it over to metaoptimize, >>>>>>>> but I'm not 100% sure I'm doing everything "right" from a scikit-learn >>>>>>>> perspective so I thought it best to ask here first. >>>>>>> >>>>>>>> I'm doing classification of fMRI data using logistic regression. I've >>>>>>>> been playing around with things for the past couple days and was >>>>>>>> getting accuracies right around or slightly above chance, which was >>>>>>>> disappointing. >>>>>>>> Initially, my code looked a bit like this: >>>>>>> >>>>>>>> pipeline = Pipeline([("scale", Scaler()), ("classify", >>>>>>>> LogisticRegression())]) >>>>>>>> cv = LeaveOneLabelOut(labels) >>>>>>>> acc = cross_val_score(pipeline, X, y, cv=cv).mean() >>>>>>>> print acc >>>>>>> >>>>>>>> 0.358599857854 >>>>>>> >>>>>>>> Labels are an int in [1, 4] specifying which fmri run each sample came >>>>>>>> from, and y has three classes. >>>>>>> >>>>>>>> When I went to inspect the predictions being made, though, I realized >>>>>>>> in each split one class was almost completely dominating: >>>>>>> >>>>>>>> cv = LeaveOneLabelOut(labels) >>>>>>>> for train, test in cv: >>>>>>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>>>>>> LogisticRegression())]) >>>>>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), >>>>>>>> 3)[0] >>>>>>> >>>>>>>> [58 0 11] >>>>>>>> [67 0 3] >>>>>>>> [ 0 70 0] >>>>>>>> [ 0 67 0] >>>>>>> >>>>>>>> Which doesn't seem right at all. I realized that if I disregard the >>>>>>>> labels and just run 5-fold cross validation, though, the balance of >>>>>>>> predictions looks much more like what I would expect: >>>>>>> >>>>>>>> cv = KFold(len(y), 5) >>>>>>>> for train, test in cv: >>>>>>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>>>>>> LogisticRegression())]) >>>>>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), >>>>>>>> 3)[0] >>>>>>> >>>>>>>> [22 16 17] >>>>>>>> [25 14 16] >>>>>>>> [17 25 13] >>>>>>>> [36 6 13] >>>>>>>> [37 9 10] >>>>>>> >>>>>>>> (Although note the still relative dominance of the first class). When >>>>>>>> I go back and run the full analysis this way, I get accuracies more in >>>>>>>> line with what I would have expected from previous fMRI studies in >>>>>>>> this domain. >>>>>>> >>>>>>>> My design is slow event-related, so my samples should be independent >>>>>>>> at least as far as HRF-blurring is considered. >>>>>>> >>>>>>>> I'm not considering error trials so the number of samples for each >>>>>>>> class is not perfectly balanced, but participants are near ceiling and >>>>>>>> thus they are very close: >>>>>>> >>>>>>>> cv = LeaveOneLabelOut(labels) >>>>>>>> for train, test in cv: >>>>>>>> print histogram(y[train], 3)[0] >>>>>>> >>>>>>>> [71 67 69] >>>>>>>> [71 68 67] >>>>>>>> [70 69 67] >>>>>>>> [70 69 70] >>>>>>> >>>>>>> >>>>>>>> Apologies for the long explanation. Two questions, really: >>>>>>> >>>>>>>> 1) Does it look like I'm doing anything obviously wrong? >>>>>>> >>>>>>>> 2) If not, can you help me build some intuition about why this is >>>>>>>> happening and what it means? Or suggest things I could look at in my >>>>>>>> data/code to identify the source of the problem? >>>>>>> >>>>>>>> I really appreciate it! Aside from this befuddling issue, I've found >>>>>>>> scikit-learn an absolute delight to use! >>>>>>> >>>>>>>> Best, >>>>>>>> Michael >>>>>>> -- >>>>>>> =------------------------------------------------------------------= >>>>>>> Keep in touch www.onerussian.com >>>>>>> Yaroslav Halchenko www.ohloh.net/accounts/yarikoptic >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> Try before you buy = See our experts in action! >>>>>>> The most comprehensive online learning library for Microsoft developers >>>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>>>>>> Metro Style Apps, more. Free future releases when you subscribe now! >>>>>>> http://p.sf.net/sfu/learndevnow-dev2 >>>>>>> _______________________________________________ >>>>>>> Scikit-learn-general mailing list >>>>>>> [email protected] >>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Try before you buy = See our experts in action! >>>>>> The most comprehensive online learning library for Microsoft developers >>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>>>>> Metro Style Apps, more. Free future releases when you subscribe now! >>>>>> http://p.sf.net/sfu/learndevnow-dev2 >>>>>> _______________________________________________ >>>>>> Scikit-learn-general mailing list >>>>>> [email protected] >>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Try before you buy = See our experts in action! >>>>> The most comprehensive online learning library for Microsoft developers >>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>>>> Metro Style Apps, more. Free future releases when you subscribe now! >>>>> http://p.sf.net/sfu/learndevnow-dev2 >>>>> _______________________________________________ >>>>> Scikit-learn-general mailing list >>>>> [email protected] >>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>> >>>> ------------------------------------------------------------------------------ >>>> Try before you buy = See our experts in action! >>>> The most comprehensive online learning library for Microsoft developers >>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>>> Metro Style Apps, more. Free future releases when you subscribe now! >>>> http://p.sf.net/sfu/learndevnow-dev2 >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> ------------------------------------------------------------------------------ >>> Try before you buy = See our experts in action! >>> The most comprehensive online learning library for Microsoft developers >>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>> Metro Style Apps, more. Free future releases when you subscribe now! >>> http://p.sf.net/sfu/learndevnow-dev2 >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> ------------------------------------------------------------------------------ >> Try before you buy = See our experts in action! >> The most comprehensive online learning library for Microsoft developers >> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >> Metro Style Apps, more. Free future releases when you subscribe now! >> http://p.sf.net/sfu/learndevnow-dev2 >> _______________________________________________ >> Scikit-learn-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > ------------------------------------------------------------------------------ > Try before you buy = See our experts in action! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-dev2 > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
