Re: [Scikit-learn-general] Causes for one class dominating?

bthirion Sun, 29 Jan 2012 13:49:00 -0800

It looks like you fit the PCA on class-specific data. You cannot expect 
that this will yield a meaningful organization when pooling across 
folds. You probably want to train the PCA on the whole dataset, or did I 
miss something ?


Bertrand

On 01/29/2012 10:38 PM, Michael Waskom wrote:
> Aha, this does indeed suggest something strange:
>
> http://web.mit.edu/mwaskom/www/pca.png
>
> I'm going to dig into this some more, but I don't really have any
> strong intuitions to guide me here so if anything pops out at you from
> that do feel free to speak up :)
>
> Michael
>
> On Sun, Jan 29, 2012 at 1:14 PM, Alexandre Gramfort
> <[email protected]>  wrote:
>> hum...
>>
>> final suggestion: I would try to visualize a 2D or 3D PCA to see if it
>> can give me some intuition on what's happening.
>>
>> Alex
>>
>> On Sun, Jan 29, 2012 at 9:58 PM, Michael Waskom<[email protected]>  wrote:
>>> Hi Alex,
>>>
>>> See my response to Yarick for some results from a binary
>>> classification.  I reran both the three-way and binary classification
>>> with SVC, though, with similar results:
>>>
>>> cv = LeaveOneLabelOut(bin_labels)
>>> pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))])
>>> print cross_val_score(pipe, bin_X, bin_y, cv=cv).mean()
>>> for train, test in cv:
>>>   pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))])
>>>   print histogram(pipe.fit(bin_X[train],
>>> bin_y[train]).predict(bin_X[test]), 2)[0]
>>>
>>> 0.496377606851
>>> [ 0 68]
>>> [ 0 70]
>>> [ 0 67]
>>> [ 0 69]
>>>
>>> cv = LeaveOneLabelOut(tri_labels)
>>> pipe = Pipeline([("scale", Scaler()), ("classify", SVC(kernel="linear"))])
>>> print cross_val_score(pipe, tri_X, tri_y, cv=cv).mean()
>>> for train, test in cv:
>>>    pipe = Pipeline([("scale", Scaler()), ("classify", 
>>> SVC(kernel="linear"))])
>>>    print histogram(pipe.fit(tri_X[train],
>>> tri_y[train]).predict(tri_X[test]), 3)[0]
>>>
>>> 0.386755821732
>>> [20  0 48]
>>> [29  1 40]
>>> [ 2  0 65]
>>> [ 0 69  0]
>>>
>>> On Sun, Jan 29, 2012 at 12:38 PM, Alexandre Gramfort
>>> <[email protected]>  wrote:
>>>> ok
>>>>
>>>> some more suggestions:
>>>>
>>>> - do you observe the same behavior with SVC which uses a different
>>>> multiclass strategy?
>>>> - what do you see when you inspect results obtained with binary
>>>> predictions (keeping 2 classes at a time)?
>>>>
>>>> Alex
>>>>
>>>> On Sun, Jan 29, 2012 at 4:59 PM, Michael Waskom<[email protected]>  
>>>> wrote:
>>>>> Hi Alex,
>>>>>
>>>>> No, each subject has four runs so I'm doing leave-one-run-out cross
>>>>> validation in the original case. I'm estimating separate models within
>>>>> each subject (as is common in fmri) so all my example code here would
>>>>> be from within a for subject in subjects: loop, but this pattern of
>>>>> weirdness is happening in every subject I've looked at so far.
>>>>>
>>>>> Michael
>>>>>
>>>>> On Sun, Jan 29, 2012 at 5:34 AM, Alexandre Gramfort
>>>>> <[email protected]>  wrote:
>>>>>> hi,
>>>>>>
>>>>>> just a thought. You seem to be doing inter-subject prediction. In this 
>>>>>> case
>>>>>> a 5 fold mixes subjects. A hint is that you may have a subject effect 
>>>>>> that
>>>>>> acts as a confound.
>>>>>>
>>>>>> again just a thought ready the email quickly
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> On Sun, Jan 29, 2012 at 5:39 AM, Michael Waskom<[email protected]>  
>>>>>> wrote:
>>>>>>> Hi Yarick, thanks for chiming in! I thought about spamming the pymvpa
>>>>>>> list, but figured one at a time :)
>>>>>>>
>>>>>>> The scikit-learn LogisticRegression class uses one-vs-all in a
>>>>>>> multiclass setting, although I also tried it with their one-vs-one
>>>>>>> metaclassifier with similar "weird" results.
>>>>>>>
>>>>>>> Interestingly, though, I think the multiclass setting is a red
>>>>>>> herring.  For this dataset we also have a two-class condition (you can
>>>>>>> think of the paradigm as a 3x2 design, although we're analyzing them
>>>>>>> separately), which has the same thing happening:
>>>>>>>
>>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>>> print cross_val_score(pipe, X, y, cv=cv).mean()
>>>>>>> for train, test in cv:
>>>>>>>    pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>>>>> LogisticRegression())])
>>>>>>>    print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]
>>>>>>>
>>>>>>> 0.496377606851
>>>>>>> [ 0 68]
>>>>>>> [ 0 70]
>>>>>>> [ 0 67]
>>>>>>> [ 0 69]
>>>>>>>
>>>>>>> cv = LeaveOneLabelOut(np.random.permutation(labels))
>>>>>>> pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>>>>> LogisticRegression())])
>>>>>>> print cross_val_score(pipe, X, y, cv=cv).mean()
>>>>>>> for train, test in cv:
>>>>>>>    print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 2)[0]
>>>>>>>
>>>>>>> 0.532455733754
>>>>>>> [40 28]
>>>>>>> [36 34]
>>>>>>> [33 34]
>>>>>>> [31 38]
>>>>>>>
>>>>>>> Best,
>>>>>>> Michael
>>>>>>>
>>>>>>> On Sat, Jan 28, 2012 at 6:09 PM, Yaroslav 
>>>>>>> Halchenko<[email protected]>  wrote:
>>>>>>>> just to educate myself -- how sklearn does multiclass decisions in this
>>>>>>>> case?  if it is all pairs classification + voting, then the answer is
>>>>>>>> simple -- ties, and the "first one in order" would take all those.
>>>>>>>>
>>>>>>>> but if there is no ties involved then, theoretically (since not sure 
>>>>>>>> if it is
>>>>>>>> applicable to your data) it is easy to come up with non-linear 
>>>>>>>> scenarios for
>>>>>>>> binary classification where 1 class would be better classified than 
>>>>>>>> the other
>>>>>>>> one with a linear classifier...  e.g. here is an example (sorry -- 
>>>>>>>> pymvpa) with
>>>>>>>> an embedded normal (i.e. both classes mean at the same spot but have
>>>>>>>> significantly different variances)
>>>>>>>>
>>>>>>>>     from mvpa2.suite import *
>>>>>>>>     ns, nf = 100, 10
>>>>>>>>     ds = dataset_wizard(
>>>>>>>>         np.vstack((
>>>>>>>>             np.random.normal(size=(ns, nf)),
>>>>>>>>             10*np.random.normal(size=(ns, nf)))),
>>>>>>>>         targets=['narrow']*ns + ['wide']*ns,
>>>>>>>>         chunks=[0,1]*ns)
>>>>>>>>     cv = CrossValidation(LinearCSVMC(), NFoldPartitioner(),
>>>>>>>>                          enable_ca=['stats'])
>>>>>>>>     cv(ds).samples
>>>>>>>>     print cv.ca.stats
>>>>>>>>
>>>>>>>> yields
>>>>>>>>
>>>>>>>>     ----------.
>>>>>>>>     predictions\targets  narrow   wide
>>>>>>>>                 `------  ------  ------  P'  N' FP FN  PPV  NPV  TPR  
>>>>>>>> SPC  FDR  MCC  AUC
>>>>>>>>            narrow         100      74   174  26 74  0 0.57   1    1  
>>>>>>>> 0.26 0.43 0.39 0.41
>>>>>>>>             wide           0       26    26 174  0 74   1  0.57 0.26   
>>>>>>>> 1    0  0.39 0.41
>>>>>>>>     Per target:          ------  ------
>>>>>>>>              P            100     100
>>>>>>>>              N            100     100
>>>>>>>>              TP           100      26
>>>>>>>>              TN            26     100
>>>>>>>>     Summary \ Means:     ------  ------ 100 100 37 37 0.79 0.79 0.63 
>>>>>>>> 0.63 0.21 0.39 0.41
>>>>>>>>            CHI^2         123.04 p=1.7e-26
>>>>>>>>             ACC           0.63
>>>>>>>>             ACC%           63
>>>>>>>>          # of sets         2
>>>>>>>>
>>>>>>>>
>>>>>>>> I bet with a bit of creativity, classifier-dependent cases of similar
>>>>>>>> cases could be found for linear underlying models.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> On Sat, 28 Jan 2012, Michael Waskom wrote:
>>>>>>>>
>>>>>>>>> Hi Folks,
>>>>>>>>> I hope you don't mind a question that's a mix of general machine
>>>>>>>>> learning and scikit-learn. I'm happy to kick it over to metaoptimize,
>>>>>>>>> but I'm not 100% sure I'm doing everything "right" from a scikit-learn
>>>>>>>>> perspective so I thought it best to ask here first.
>>>>>>>>> I'm doing classification of fMRI data using logistic regression.  I've
>>>>>>>>> been playing around with things for the past couple days and was
>>>>>>>>> getting accuracies right around or slightly above chance, which was
>>>>>>>>> disappointing.
>>>>>>>>> Initially, my code looked a bit like this:
>>>>>>>>> pipeline = Pipeline([("scale", Scaler()), ("classify", 
>>>>>>>>> LogisticRegression())])
>>>>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>>>>> acc = cross_val_score(pipeline, X, y, cv=cv).mean()
>>>>>>>>> print acc
>>>>>>>>> 0.358599857854
>>>>>>>>> Labels are an int in [1, 4] specifying which fmri run each sample came
>>>>>>>>> from, and y has three classes.
>>>>>>>>> When I went to inspect the predictions being made, though, I realized
>>>>>>>>> in each split one class was almost completely dominating:
>>>>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>>>>> for train, test in cv:
>>>>>>>>>      pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>>>>>>> LogisticRegression())])
>>>>>>>>>      print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 
>>>>>>>>> 3)[0]
>>>>>>>>> [58  0 11]
>>>>>>>>> [67  0  3]
>>>>>>>>> [ 0 70  0]
>>>>>>>>> [ 0 67  0]
>>>>>>>>> Which doesn't seem right at all.  I realized that if I disregard the
>>>>>>>>> labels and just run 5-fold cross validation, though, the balance of
>>>>>>>>> predictions looks much more like what I would expect:
>>>>>>>>> cv = KFold(len(y), 5)
>>>>>>>>> for train, test in cv:
>>>>>>>>>      pipe = Pipeline([("scale", Scaler()), ("classify", 
>>>>>>>>> LogisticRegression())])
>>>>>>>>>      print histogram(pipe.fit(X[train], y[train]).predict(X[test]), 
>>>>>>>>> 3)[0]
>>>>>>>>> [22 16 17]
>>>>>>>>> [25 14 16]
>>>>>>>>> [17 25 13]
>>>>>>>>> [36  6 13]
>>>>>>>>> [37  9 10]
>>>>>>>>> (Although note the still relative dominance of the first class).  When
>>>>>>>>> I go back and run the full analysis this way, I get accuracies more in
>>>>>>>>> line with what I would have expected from previous fMRI studies in
>>>>>>>>> this domain.
>>>>>>>>> My design is slow event-related, so my samples should be independent
>>>>>>>>> at least as far as HRF-blurring is considered.
>>>>>>>>> I'm not considering error trials so the number of samples for each
>>>>>>>>> class is not perfectly balanced, but participants are near ceiling and
>>>>>>>>> thus they are very close:
>>>>>>>>> cv = LeaveOneLabelOut(labels)
>>>>>>>>> for train, test in cv:
>>>>>>>>>      print histogram(y[train], 3)[0]
>>>>>>>>> [71 67 69]
>>>>>>>>> [71 68 67]
>>>>>>>>> [70 69 67]
>>>>>>>>> [70 69 70]
>>>>>>>>
>>>>>>>>> Apologies for the long explanation.  Two questions, really:
>>>>>>>>> 1) Does it look like I'm doing anything obviously wrong?
>>>>>>>>> 2) If not, can you help me build some intuition about why this is
>>>>>>>>> happening and what it means? Or suggest things I could look  at in my
>>>>>>>>> data/code to identify the source of the problem?
>>>>>>>>> I really appreciate it!  Aside from this befuddling issue, I've found
>>>>>>>>> scikit-learn an absolute delight to use!
>>>>>>>>> Best,
>>>>>>>>> Michael
>>>>>>>> --
>>>>>>>> =------------------------------------------------------------------=
>>>>>>>> Keep in touch                                     www.onerussian.com
>>>>>>>> Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> Try before you buy = See our experts in action!
>>>>>>>> The most comprehensive online learning library for Microsoft developers
>>>>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, 
>>>>>>>> MVC3,
>>>>>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>>>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>>>>>> _______________________________________________
>>>>>>>> Scikit-learn-general mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Try before you buy = See our experts in action!
>>>>>>> The most comprehensive online learning library for Microsoft developers
>>>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>>>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>> ------------------------------------------------------------------------------
>>>>>> Try before you buy = See our experts in action!
>>>>>> The most comprehensive online learning library for Microsoft developers
>>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>> ------------------------------------------------------------------------------
>>>>> Try before you buy = See our experts in action!
>>>>> The most comprehensive online learning library for Microsoft developers
>>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>> ------------------------------------------------------------------------------
>>>> Try before you buy = See our experts in action!
>>>> The most comprehensive online learning library for Microsoft developers
>>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>>> http://p.sf.net/sfu/learndevnow-dev2
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> ------------------------------------------------------------------------------
>>> Try before you buy = See our experts in action!
>>> The most comprehensive online learning library for Microsoft developers
>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>> http://p.sf.net/sfu/learndevnow-dev2
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> ------------------------------------------------------------------------------
>> Try before you buy = See our experts in action!
>> The most comprehensive online learning library for Microsoft developers
>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>> Metro Style Apps, more. Free future releases when you subscribe now!
>> http://p.sf.net/sfu/learndevnow-dev2
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Causes for one class dominating?

Reply via email to