It looks like you fit the PCA on class-specific data. You cannot expect that this will yield a meaningful organization when pooling across folds. You probably want to train the PCA on the whole dataset, or did I miss something ?
Bertrand On 01/29/2012 10:38 PM, Michael Waskom wrote: > Aha, this does indeed suggest something strange: > > http://web.mit.edu/mwaskom/www/pca.png > > I'm going to dig into this some more, but I don't really have any > strong intuitions to guide me here so if anything pops out at you from > that do feel free to speak up :) > > Michael > > On Sun, Jan 29, 2012 at 1:14 PM, Alexandre Gramfort > <[email protected]> wrote: >> hum... >> >> final suggestion: I would try to visualize a 2D or 3D PCA to see if it >> can give me some intuition on what's happening. >> >> Alex >> >> On Sun, Jan 29, 2012 at 9:58 PM, Michael Waskom<[email protected]> wrote: >>> Hi Alex, >>> >>> See my response to Yarick for some results from a binary >>> classification. I reran both the three-way and binary classification >>> with SVC, though, with similar results: >>> >>> cv = LeaveOneLabelOut(bin_labels) >>> pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))]) >>> print cross_val_score(pipe, bin_X, bin_y, cv=cv).mean() >>> for train, test in cv: >>> pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))]) >>> print histogram(pipe.fit(bin_X[train], >>> bin_y[train]).predict(bin_X[test]), 2)[0] >>> >>> 0.496377606851 >>> [ 0 68] >>> [ 0 70] >>> [ 0 67] >>> [ 0 69] >>> >>> cv = LeaveOneLabelOut(tri_labels) >>> pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))]) >>> print cross_val_score(pipe, tri_X, tri_y, cv=cv).mean() >>> for train, test in cv: >>> pipe = Pipeline([("scale", Scaler()), ("classify", >>> SVC(kernel="linear"))]) >>> print histogram(pipe.fit(tri_X[train], >>> tri_y[train]).predict(tri_X[test]), 3)[0] >>> >>> 0.386755821732 >>> [20 0 48] >>> [29 1 40] >>> [ 2 0 65] >>> [ 0 69 0] >>> >>> On Sun, Jan 29, 2012 at 12:38 PM, Alexandre Gramfort >>> <[email protected]> wrote: >>>> ok >>>> >>>> some more suggestions: >>>> >>>> - do you observe the same behavior with SVC which uses a different >>>> multiclass strategy? >>>> - what do you see when you inspect results obtained with binary >>>> predictions (keeping 2 classes at a time)? >>>> >>>> Alex >>>> >>>> On Sun, Jan 29, 2012 at 4:59 PM, Michael Waskom<[email protected]> >>>> wrote: >>>>> Hi Alex, >>>>> >>>>> No, each subject has four runs so I'm doing leave-one-run-out cross >>>>> validation in the original case. I'm estimating separate models within >>>>> each subject (as is common in fmri) so all my example code here would >>>>> be from within a for subject in subjects: loop, but this pattern of >>>>> weirdness is happening in every subject I've looked at so far. >>>>> >>>>> Michael >>>>> >>>>> On Sun, Jan 29, 2012 at 5:34 AM, Alexandre Gramfort >>>>> <[email protected]> wrote: >>>>>> hi, >>>>>> >>>>>> just a thought. You seem to be doing inter-subject prediction. In this >>>>>> case >>>>>> a 5 fold mixes subjects. A hint is that you may have a subject effect >>>>>> that >>>>>> acts as a confound. >>>>>> >>>>>> again just a thought ready the email quickly >>>>>> >>>>>> Alex >>>>>> >>>>>> On Sun, Jan 29, 2012 at 5:39 AM, Michael Waskom<[email protected]> >>>>>> wrote: >>>>>>> Hi Yarick, thanks for chiming in! I thought about spamming the pymvpa >>>>>>> list, but figured one at a time :) >>>>>>> >>>>>>> The scikit-learn LogisticRegression class uses one-vs-all in a >>>>>>> multiclass setting, although I also tried it with their one-vs-one >>>>>>> metaclassifier with similar "weird" results. >>>>>>> >>>>>>> Interestingly, though, I think the multiclass setting is a red >>>>>>> herring. For this dataset we also have a two-class condition (you can >>>>>>> think of the paradigm as a 3x2 design, although we're analyzing them >>>>>>> separately), which has the same thing happening: >>>>>>> >>>>>>> cv = LeaveOneLabelOut(labels) >>>>>>> print cross_val_score(pipe, X, y, cv=cv).mean() >>>>>>> for train, test in cv: >>>>>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>>>>> LogisticRegression())]) >>>>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0] >>>>>>> >>>>>>> 0.496377606851 >>>>>>> [ 0 68] >>>>>>> [ 0 70] >>>>>>> [ 0 67] >>>>>>> [ 0 69] >>>>>>> >>>>>>> cv = LeaveOneLabelOut(np.random.permutation(labels)) >>>>>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>>>>> LogisticRegression())]) >>>>>>> print cross_val_score(pipe, X, y, cv=cv).mean() >>>>>>> for train, test in cv: >>>>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0] >>>>>>> >>>>>>> 0.532455733754 >>>>>>> [40 28] >>>>>>> [36 34] >>>>>>> [33 34] >>>>>>> [31 38] >>>>>>> >>>>>>> Best, >>>>>>> Michael >>>>>>> >>>>>>> On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav >>>>>>> Halchenko<[email protected]> wrote: >>>>>>>> just to educate myself -- how sklearn does multiclass decisions in this >>>>>>>> case? if it is all pairs classification + voting, then the answer is >>>>>>>> simple -- ties, and the "first one in order" would take all those. >>>>>>>> >>>>>>>> but if there is no ties involved then, theoretically (since not sure >>>>>>>> if it is >>>>>>>> applicable to your data) it is easy to come up with non-linear >>>>>>>> scenarios for >>>>>>>> binary classification where 1 class would be better classified than >>>>>>>> the other >>>>>>>> one with a linear classifier... e.g. here is an example (sorry -- >>>>>>>> pymvpa) with >>>>>>>> an embedded normal (i.e. both classes mean at the same spot but have >>>>>>>> significantly different variances) >>>>>>>> >>>>>>>> from mvpa2.suite import * >>>>>>>> ns, nf = 100, 10 >>>>>>>> ds = dataset_wizard( >>>>>>>> np.vstack(( >>>>>>>> np.random.normal(size=(ns, nf)), >>>>>>>> 10*np.random.normal(size=(ns, nf)))), >>>>>>>> targets=['narrow']*ns + ['wide']*ns, >>>>>>>> chunks=[0,1]*ns) >>>>>>>> cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(), >>>>>>>> enable_ca=['stats']) >>>>>>>> cv(ds).samples >>>>>>>> print cv.ca.stats >>>>>>>> >>>>>>>> yields >>>>>>>> >>>>>>>> ----------. >>>>>>>> predictions\targets narrow wide >>>>>>>> `------ ------ ------ P' N' FP FN PPV NPV TPR >>>>>>>> SPC FDR MCC AUC >>>>>>>> narrow 100 74 174 26 74 0 0.57 1 1 >>>>>>>> 0.26 0.43 0.39 0.41 >>>>>>>> wide 0 26 26 174 0 74 1 0.57 0.26 >>>>>>>> 1 0 0.39 0.41 >>>>>>>> Per target: ------ ------ >>>>>>>> P 100 100 >>>>>>>> N 100 100 >>>>>>>> TP 100 26 >>>>>>>> TN 26 100 >>>>>>>> Summary \ Means: ------ ------ 100 100 37 37 0.79 0.79 0.63 >>>>>>>> 0.63 0.21 0.39 0.41 >>>>>>>> CHI^2 123.04 p=1.7e-26 >>>>>>>> ACC 0.63 >>>>>>>> ACC% 63 >>>>>>>> # of sets 2 >>>>>>>> >>>>>>>> >>>>>>>> I bet with a bit of creativity, classifier-dependent cases of similar >>>>>>>> cases could be found for linear underlying models. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> On Sat, 28 Jan 2012, Michael Waskom wrote: >>>>>>>> >>>>>>>>> Hi Folks, >>>>>>>>> I hope you don't mind a question that's a mix of general machine >>>>>>>>> learning and scikit-learn. I'm happy to kick it over to metaoptimize, >>>>>>>>> but I'm not 100% sure I'm doing everything "right" from a scikit-learn >>>>>>>>> perspective so I thought it best to ask here first. >>>>>>>>> I'm doing classification of fMRI data using logistic regression. I've >>>>>>>>> been playing around with things for the past couple days and was >>>>>>>>> getting accuracies right around or slightly above chance, which was >>>>>>>>> disappointing. >>>>>>>>> Initially, my code looked a bit like this: >>>>>>>>> pipeline = Pipeline([("scale", Scaler()), ("classify", >>>>>>>>> LogisticRegression())]) >>>>>>>>> cv = LeaveOneLabelOut(labels) >>>>>>>>> acc = cross_val_score(pipeline, X, y, cv=cv).mean() >>>>>>>>> print acc >>>>>>>>> 0.358599857854 >>>>>>>>> Labels are an int in [1, 4] specifying which fmri run each sample came >>>>>>>>> from, and y has three classes. >>>>>>>>> When I went to inspect the predictions being made, though, I realized >>>>>>>>> in each split one class was almost completely dominating: >>>>>>>>> cv = LeaveOneLabelOut(labels) >>>>>>>>> for train, test in cv: >>>>>>>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>>>>>>> LogisticRegression())]) >>>>>>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), >>>>>>>>> 3)[0] >>>>>>>>> [58 0 11] >>>>>>>>> [67 0 3] >>>>>>>>> [ 0 70 0] >>>>>>>>> [ 0 67 0] >>>>>>>>> Which doesn't seem right at all. I realized that if I disregard the >>>>>>>>> labels and just run 5-fold cross validation, though, the balance of >>>>>>>>> predictions looks much more like what I would expect: >>>>>>>>> cv = KFold(len(y), 5) >>>>>>>>> for train, test in cv: >>>>>>>>> pipe = Pipeline([("scale", Scaler()), ("classify", >>>>>>>>> LogisticRegression())]) >>>>>>>>> print histogram(pipe.fit(X[train], y[train]).predict(X[test]), >>>>>>>>> 3)[0] >>>>>>>>> [22 16 17] >>>>>>>>> [25 14 16] >>>>>>>>> [17 25 13] >>>>>>>>> [36 6 13] >>>>>>>>> [37 9 10] >>>>>>>>> (Although note the still relative dominance of the first class). When >>>>>>>>> I go back and run the full analysis this way, I get accuracies more in >>>>>>>>> line with what I would have expected from previous fMRI studies in >>>>>>>>> this domain. >>>>>>>>> My design is slow event-related, so my samples should be independent >>>>>>>>> at least as far as HRF-blurring is considered. >>>>>>>>> I'm not considering error trials so the number of samples for each >>>>>>>>> class is not perfectly balanced, but participants are near ceiling and >>>>>>>>> thus they are very close: >>>>>>>>> cv = LeaveOneLabelOut(labels) >>>>>>>>> for train, test in cv: >>>>>>>>> print histogram(y[train], 3)[0] >>>>>>>>> [71 67 69] >>>>>>>>> [71 68 67] >>>>>>>>> [70 69 67] >>>>>>>>> [70 69 70] >>>>>>>> >>>>>>>>> Apologies for the long explanation. Two questions, really: >>>>>>>>> 1) Does it look like I'm doing anything obviously wrong? >>>>>>>>> 2) If not, can you help me build some intuition about why this is >>>>>>>>> happening and what it means? Or suggest things I could look at in my >>>>>>>>> data/code to identify the source of the problem? >>>>>>>>> I really appreciate it! Aside from this befuddling issue, I've found >>>>>>>>> scikit-learn an absolute delight to use! >>>>>>>>> Best, >>>>>>>>> Michael >>>>>>>> -- >>>>>>>> =------------------------------------------------------------------= >>>>>>>> Keep in touch www.onerussian.com >>>>>>>> Yaroslav Halchenko www.ohloh.net/accounts/yarikoptic >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> Try before you buy = See our experts in action! >>>>>>>> The most comprehensive online learning library for Microsoft developers >>>>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, >>>>>>>> MVC3, >>>>>>>> Metro Style Apps, more. Free future releases when you subscribe now! >>>>>>>> http://p.sf.net/sfu/learndevnow-dev2 >>>>>>>> _______________________________________________ >>>>>>>> Scikit-learn-general mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>>> ------------------------------------------------------------------------------ >>>>>>> Try before you buy = See our experts in action! >>>>>>> The most comprehensive online learning library for Microsoft developers >>>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>>>>>> Metro Style Apps, more. Free future releases when you subscribe now! >>>>>>> http://p.sf.net/sfu/learndevnow-dev2 >>>>>>> _______________________________________________ >>>>>>> Scikit-learn-general mailing list >>>>>>> [email protected] >>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>> ------------------------------------------------------------------------------ >>>>>> Try before you buy = See our experts in action! >>>>>> The most comprehensive online learning library for Microsoft developers >>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>>>>> Metro Style Apps, more. Free future releases when you subscribe now! >>>>>> http://p.sf.net/sfu/learndevnow-dev2 >>>>>> _______________________________________________ >>>>>> Scikit-learn-general mailing list >>>>>> [email protected] >>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>> ------------------------------------------------------------------------------ >>>>> Try before you buy = See our experts in action! >>>>> The most comprehensive online learning library for Microsoft developers >>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>>>> Metro Style Apps, more. Free future releases when you subscribe now! >>>>> http://p.sf.net/sfu/learndevnow-dev2 >>>>> _______________________________________________ >>>>> Scikit-learn-general mailing list >>>>> [email protected] >>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>> ------------------------------------------------------------------------------ >>>> Try before you buy = See our experts in action! >>>> The most comprehensive online learning library for Microsoft developers >>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>>> Metro Style Apps, more. Free future releases when you subscribe now! >>>> http://p.sf.net/sfu/learndevnow-dev2 >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> ------------------------------------------------------------------------------ >>> Try before you buy = See our experts in action! >>> The most comprehensive online learning library for Microsoft developers >>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>> Metro Style Apps, more. Free future releases when you subscribe now! >>> http://p.sf.net/sfu/learndevnow-dev2 >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> ------------------------------------------------------------------------------ >> Try before you buy = See our experts in action! >> The most comprehensive online learning library for Microsoft developers >> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >> Metro Style Apps, more. Free future releases when you subscribe now! >> http://p.sf.net/sfu/learndevnow-dev2 >> _______________________________________________ >> Scikit-learn-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ------------------------------------------------------------------------------ > Try before you buy = See our experts in action! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-dev2 > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
