Re: [Scikit-learn-general] Causes for one class dominating?

Michael Waskom Sun, 29 Jan 2012 14:03:09 -0800

Sorry, probably not clear from that snippet, but the labels vector
corresponds to run (and is the id i'm using for the leave-one-label
out CV strategy that's giving me problems).  My (perhaps naive)
assumption would be that the dataset should be distributed more or
less evenly across these splits, but the fact that they seem separable
in this PCA space is suggesting to me that something is wrong with my
code to extract the dataset from my images.


On Sun, Jan 29, 2012 at 1:48 PM, bthirion <[email protected]> wrote:
> It looks like you fit the PCA on class-specific data. You cannot expect
> that this will yield a meaningful organization when pooling across
> folds. You probably want to train the PCA on the whole dataset, or did I
> miss something ?
>
> Bertrand
>
> On 01/29/2012 10:38 PM, Michael Waskom wrote:
>> Aha, this does indeed suggest something strange:
>>
>> http://web.mit.edu/mwaskom/www/pca.png
>>
>> I'm going to dig into this some more, but I don't really have any
>> strong intuitions to guide me here so if anything pops out at you from
>> that do feel free to speak up :)
>>
>> Michael
>>
>> On Sun, Jan 29, 2012 at 1:14 PM, Alexandre Gramfort
>> <[email protected]>  wrote:
>>> hum...
>>>
>>> final suggestion: I would try to visualize a 2D or 3D PCA to see if it
>>> can give me some intuition on what's happening.
>>>
>>> Alex
>>>
>>> On Sun, Jan 29, 2012 at 9:58 PM, Michael Waskom<[email protected]>  
>>> wrote:
>>>> Hi Alex,
>>>>
>>>> See my response to Yarick for some results from a binary
>>>> classification.  I reran both the three-way and binary classification
>>>> with SVC, though, with similar results:
>>>>
>>>> cv = LeaveOneLabelOut(bin_labels)
>>>> pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))])
>>>> print cross_val_score(pipe, bin_X, bin_y, cv=cv).mean()
>>>> for train, test in cv:
>>>>   pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>> SVC(kernel="linear"))])
>>>>   print histogram(pipe.fit(bin_X[train],
>>>> bin_y[train]).predict(bin_X[test]), 2)[0]
>>>>
>>>> 0.496377606851
>>>> [ 0 68]
>>>> [ 0 70]
>>>> [ 0 67]
>>>> [ 0 69]
>>>>
>>>> cv = LeaveOneLabelOut(tri_labels)
>>>> pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))])
>>>> print cross_val_score(pipe, tri_X, tri_y, cv=cv).mean()
>>>> for train, test in cv:
>>>>    pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>> SVC(kernel="linear"))])
>>>>    print histogram(pipe.fit(tri_X[train],
>>>> tri_y[train]).predict(tri_X[test]), 3)[0]
>>>>
>>>> 0.386755821732
>>>> [20  0 48]
>>>> [29  1 40]
>>>> [ 2  0 65]
>>>> [ 0 69  0]
>>>>
>>>> On Sun, Jan 29, 2012 at 12:38 PM, Alexandre Gramfort
>>>> <[email protected]>  wrote:
>>>>> ok
>>>>>
>>>>> some more suggestions:
>>>>>
>>>>> - do you observe the same behavior with SVC which uses a different
>>>>> multiclass strategy?
>>>>> - what do you see when you inspect results obtained with binary
>>>>> predictions (keeping 2 classes at a time)?
>>>>>
>>>>> Alex
>>>>>
>>>>> On Sun, Jan 29, 2012 at 4:59 PM, Michael Waskom<[email protected]>  
>>>>> wrote:
>>>>>> Hi Alex,
>>>>>>
>>>>>> No, each subject has four runs so I'm doing leave-one-run-out cross
>>>>>> validation in the original case. I'm estimating separate models within
>>>>>> each subject (as is common in fmri) so all my example code here would
>>>>>> be from within a for subject in subjects: loop, but this pattern of
>>>>>> weirdness is happening in every subject I've looked at so far.
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>> On Sun, Jan 29, 2012 at 5:34 AM, Alexandre Gramfort
>>>>>> <[email protected]>  wrote:
>>>>>>> hi,
>>>>>>>
>>>>>>> just a thought. You seem to be doing inter-subject prediction. In this 
>>>>>>> case
>>>>>>> a 5 fold mixes subjects. A hint is that you may have a subject effect 
>>>>>>> that
>>>>>>> acts as a confound.
>>>>>>>
>>>>>>> again just a thought ready the email quickly
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>> On Sun, Jan 29, 2012 at 5:39 AM, Michael Waskom<[email protected]>  
>>>>>>> wrote:
>>>>>>>> Hi Yarick, thanks for chiming in! I thought about spamming the pymvpa
>>>>>>>> list, but figured one at a time :)
>>>>>>>>
>>>>>>>> The scikit-learn LogisticRegression class uses one-vs-all in a
>>>>>>>> multiclass setting, although I also tried it with their one-vs-one
>>>>>>>> metaclassifier with similar "weird" results.
>>>>>>>>
>>>>>>>> Interestingly, though, I think the multiclass setting is a red
>>>>>>>> herring.  For this dataset we also have a two-class condition (you can
>>>>>>>> think of the paradigm as a 3x2 design, although we're analyzing them
>>>>>>>> separately), which has the same thing happening:
>>>>>>>>
>>>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>>>> print cross_val_score(pipe, X, y, cv=cv).mean()
>>>>>>>> for train, test in cv:
>>>>>>>>    pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>>>>>> LogisticRegression())])
>>>>>>>>    print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]
>>>>>>>>
>>>>>>>> 0.496377606851
>>>>>>>> [ 0 68]
>>>>>>>> [ 0 70]
>>>>>>>> [ 0 67]
>>>>>>>> [ 0 69]
>>>>>>>>
>>>>>>>> cv = LeaveOneLabelOut(np.random.permutation(labels))
>>>>>>>> pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>>>>>> LogisticRegression())])
>>>>>>>> print cross_val_score(pipe, X, y, cv=cv).mean()
>>>>>>>> for train, test in cv:
>>>>>>>>    print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]
>>>>>>>>
>>>>>>>> 0.532455733754
>>>>>>>> [40 28]
>>>>>>>> [36 34]
>>>>>>>> [33 34]
>>>>>>>> [31 38]
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Michael
>>>>>>>>
>>>>>>>> On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav 
>>>>>>>> Halchenko<[email protected]>  wrote:
>>>>>>>>> just to educate myself -- how sklearn does multiclass decisions in 
>>>>>>>>> this
>>>>>>>>> case?  if it is all pairs classification + voting, then the answer is
>>>>>>>>> simple -- ties, and the "first one in order" would take all those.
>>>>>>>>>
>>>>>>>>> but if there is no ties involved then, theoretically (since not sure 
>>>>>>>>> if it is
>>>>>>>>> applicable to your data) it is easy to come up with non-linear 
>>>>>>>>> scenarios for
>>>>>>>>> binary classification where 1 class would be better classified than 
>>>>>>>>> the other
>>>>>>>>> one with a linear classifier...  e.g. here is an example (sorry -- 
>>>>>>>>> pymvpa) with
>>>>>>>>> an embedded normal (i.e. both classes mean at the same spot but have
>>>>>>>>> significantly different variances)
>>>>>>>>>
>>>>>>>>>     from mvpa2.suite import *
>>>>>>>>>     ns, nf = 100, 10
>>>>>>>>>     ds = dataset_wizard(
>>>>>>>>>         np.vstack((
>>>>>>>>>             np.random.normal(size=(ns, nf)),
>>>>>>>>>             10*np.random.normal(size=(ns, nf)))),
>>>>>>>>>         targets=['narrow']*ns + ['wide']*ns,
>>>>>>>>>         chunks=[0,1]*ns)
>>>>>>>>>     cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(),
>>>>>>>>>                          enable_ca=['stats'])
>>>>>>>>>     cv(ds).samples
>>>>>>>>>     print cv.ca.stats
>>>>>>>>>
>>>>>>>>> yields
>>>>>>>>>
>>>>>>>>>     ----------.
>>>>>>>>>     predictions\targets  narrow   wide
>>>>>>>>>                 `------  ------  ------  P'  N' FP FN  PPV  NPV  TPR  
>>>>>>>>> SPC  FDR  MCC  AUC
>>>>>>>>>            narrow         100      74   174  26 74  0 0.57   1    1  
>>>>>>>>> 0.26 0.43 0.39 0.41
>>>>>>>>>             wide           0       26    26 174  0 74   1  0.57 0.26  
>>>>>>>>>  1    0  0.39 0.41
>>>>>>>>>     Per target:          ------  ------
>>>>>>>>>              P            100     100
>>>>>>>>>              N            100     100
>>>>>>>>>              TP           100      26
>>>>>>>>>              TN            26     100
>>>>>>>>>     Summary \ Means:     ------  ------ 100 100 37 37 0.79 0.79 0.63 
>>>>>>>>> 0.63 0.21 0.39 0.41
>>>>>>>>>            CHI^2         123.04 p=1.7e-26
>>>>>>>>>             ACC           0.63
>>>>>>>>>             ACC%           63
>>>>>>>>>          # of sets         2
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I bet with a bit of creativity, classifier-dependent cases of similar
>>>>>>>>> cases could be found for linear underlying models.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> On Sat, 28 Jan 2012, Michael Waskom wrote:
>>>>>>>>>
>>>>>>>>>> Hi Folks,
>>>>>>>>>> I hope you don't mind a question that's a mix of general machine
>>>>>>>>>> learning and scikit-learn. I'm happy to kick it over to metaoptimize,
>>>>>>>>>> but I'm not 100% sure I'm doing everything "right" from a 
>>>>>>>>>> scikit-learn
>>>>>>>>>> perspective so I thought it best to ask here first.
>>>>>>>>>> I'm doing classification of fMRI data using logistic regression.  
>>>>>>>>>> I've
>>>>>>>>>> been playing around with things for the past couple days and was
>>>>>>>>>> getting accuracies right around or slightly above chance, which was
>>>>>>>>>> disappointing.
>>>>>>>>>> Initially, my code looked a bit like this:
>>>>>>>>>> pipeline = Pipeline([("scale", Scaler()), ("classify", 
>>>>>>>>>> LogisticRegression())])
>>>>>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>>>>>> acc = cross_val_score(pipeline, X, y, cv=cv).mean()
>>>>>>>>>> print acc
>>>>>>>>>> 0.358599857854
>>>>>>>>>> Labels are an int in [1, 4] specifying which fmri run each sample 
>>>>>>>>>> came
>>>>>>>>>> from, and y has three classes.
>>>>>>>>>> When I went to inspect the predictions being made, though, I realized
>>>>>>>>>> in each split one class was almost completely dominating:
>>>>>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>>>>>> for train, test in cv:
>>>>>>>>>>      pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>>>>>>>> LogisticRegression())])
>>>>>>>>>>      print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 
>>>>>>>>>> 3)[0]
>>>>>>>>>> [58  0 11]
>>>>>>>>>> [67  0  3]
>>>>>>>>>> [ 0 70  0]
>>>>>>>>>> [ 0 67  0]
>>>>>>>>>> Which doesn't seem right at all.  I realized that if I disregard the
>>>>>>>>>> labels and just run 5-fold cross validation, though, the balance of
>>>>>>>>>> predictions looks much more like what I would expect:
>>>>>>>>>> cv = KFold(len(y), 5)
>>>>>>>>>> for train, test in cv:
>>>>>>>>>>      pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>>>>>>>> LogisticRegression())])
>>>>>>>>>>      print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 
>>>>>>>>>> 3)[0]
>>>>>>>>>> [22 16 17]
>>>>>>>>>> [25 14 16]
>>>>>>>>>> [17 25 13]
>>>>>>>>>> [36  6 13]
>>>>>>>>>> [37  9 10]
>>>>>>>>>> (Although note the still relative dominance of the first class).  
>>>>>>>>>> When
>>>>>>>>>> I go back and run the full analysis this way, I get accuracies more 
>>>>>>>>>> in
>>>>>>>>>> line with what I would have expected from previous fMRI studies in
>>>>>>>>>> this domain.
>>>>>>>>>> My design is slow event-related, so my samples should be independent
>>>>>>>>>> at least as far as HRF-blurring is considered.
>>>>>>>>>> I'm not considering error trials so the number of samples for each
>>>>>>>>>> class is not perfectly balanced, but participants are near ceiling 
>>>>>>>>>> and
>>>>>>>>>> thus they are very close:
>>>>>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>>>>>> for train, test in cv:
>>>>>>>>>>      print histogram(y[train], 3)[0]
>>>>>>>>>> [71 67 69]
>>>>>>>>>> [71 68 67]
>>>>>>>>>> [70 69 67]
>>>>>>>>>> [70 69 70]
>>>>>>>>>
>>>>>>>>>> Apologies for the long explanation.  Two questions, really:
>>>>>>>>>> 1) Does it look like I'm doing anything obviously wrong?
>>>>>>>>>> 2) If not, can you help me build some intuition about why this is
>>>>>>>>>> happening and what it means? Or suggest things I could look  at in my
>>>>>>>>>> data/code to identify the source of the problem?
>>>>>>>>>> I really appreciate it!  Aside from this befuddling issue, I've found
>>>>>>>>>> scikit-learn an absolute delight to use!
>>>>>>>>>> Best,
>>>>>>>>>> Michael
>>>>>>>>> --
>>>>>>>>> =------------------------------------------------------------------=
>>>>>>>>> Keep in touch                                     www.onerussian.com
>>>>>>>>> Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>> Try before you buy = See our experts in action!
>>>>>>>>> The most comprehensive online learning library for Microsoft 
>>>>>>>>> developers
>>>>>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, 
>>>>>>>>> MVC3,
>>>>>>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>>>>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>>>>>>> _______________________________________________
>>>>>>>>> Scikit-learn-general mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> Try before you buy = See our experts in action!
>>>>>>>> The most comprehensive online learning library for Microsoft developers
>>>>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, 
>>>>>>>> MVC3,
>>>>>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>>>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>>>>>> _______________________________________________
>>>>>>>> Scikit-learn-general mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Try before you buy = See our experts in action!
>>>>>>> The most comprehensive online learning library for Microsoft developers
>>>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>>>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>> ------------------------------------------------------------------------------
>>>>>> Try before you buy = See our experts in action!
>>>>>> The most comprehensive online learning library for Microsoft developers
>>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>> ------------------------------------------------------------------------------
>>>>> Try before you buy = See our experts in action!
>>>>> The most comprehensive online learning library for Microsoft developers
>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>> ------------------------------------------------------------------------------
>>>> Try before you buy = See our experts in action!
>>>> The most comprehensive online learning library for Microsoft developers
>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> ------------------------------------------------------------------------------
>>> Try before you buy = See our experts in action!
>>> The most comprehensive online learning library for Microsoft developers
>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>> http://p.sf.net/sfu/learndevnow-dev2
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> ------------------------------------------------------------------------------
>> Try before you buy = See our experts in action!
>> The most comprehensive online learning library for Microsoft developers
>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>> Metro Style Apps, more. Free future releases when you subscribe now!
>> http://p.sf.net/sfu/learndevnow-dev2
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Causes for one class dominating?

Reply via email to