Re: [pymvpa] unbalanced datasets

J.A. Etzel Fri, 10 Aug 2012 12:51:12 -0700

I'd phrase that as "breaking the run structure is **SOMETIMES** ok forclassification" ...

You should definitely check the partitioning schemes for order effectsand be as conservative as possible. For example, you could ensure thatsamples collected 2 TR apart are not split between the training andtesting sets, etc. If you have signs that your results are verysensitive to partitioning scheme (e.g. vastly different accuracies whenshifting the partitioning by one sample) you'll need to look closer atdependencies. Basically, use extreme caution.

I'm not current enough on pymvpa to give any advice there ... I dostrongly suggest pre-planning the partitioning schemes (i.e. whichexamples in which sets on which folds and replications) prior tostarting to ensure balance and to enable sensitivity testing.


Jo



On 8/10/2012 1:53 PM, Edmund Chong wrote:

Thanks Jo!

For me, I don't have that many runs, so partitioning on groups of runs
is not really a good option. So I'd rather try doing "leave-n-samples
out" instead of "leave-n-runs out" -- I looked at your paper and indeed
it seems that breaking the run structure is ok for classification
However, do you know of any way to do that with pre-written pymvpa
functions, or do I have to manually partition the runs myself?

Thanks!
-Edmund

On Wed, Aug 8, 2012 at 3:37 PM, J.A. Etzel <[email protected]
<mailto:[email protected]>> wrote:

    What you describe is one option. I talked about those types of
    schemes and when they can be ok (in my opinion!) in
    http://dx.doi.org/10.1016/j.__neuroimage.2010.08.050
    <http://dx.doi.org/10.1016/j.neuroimage.2010.08.050>

    As general advice, it seems best to try to partition so that the
    number of examples of each case in each cross-validation fold is
    roughly equal. Sometimes that's just plain not possible. For
    example, I have a dataset with a large number of runs, but only
    trials the person gets correct are analyzed, so the number of
    examples in some runs for some people varies drastically. What we
    did in this case was to partition on groups of runs, so one fold is
    to leave runs 1,2,3, and 4 out. This scheme equalized the number of
    examples somewhat (though I still subsetted examples to make them
    exactly equal), and seemed to help the amount of variation.

    Jo



    On 8/7/2012 10:52 AM, Edmund Chong wrote:

        Hi all,

        I recently asked a question on dealing with unbalanced datasets and
        here's a follow-up question.
        So let's say I have empty runs, or runs where there are zero
        samples for
        one of the conditions. This leads to problems if that run
        happens to be
        the test run on a leave-one-run-out cross-validation procedure.

        My workaround for that was this: if I had one of such runs with
        empty
        conditions, then I would set NFoldPartitioner(cvtype=2),
        together with
        Balancer() so that any combination of two runs would have at
        least one
        sample per condition. But if I had two of such runs with empty
        conditions, then I would set cvtype=3, and so on. However this
        means I
        have less data for the training set on each classification fold.

        Is there any other possible solution for this? In fact, is it
        possible
        to do leave-n-samples-out classification: So on each fold I randomly
        select n samples per condition to test on, and use the remaining
        samples
        (after balancing) for training, disregarding the chunks structure?

        Thanks!
        -Edmund


        _________________________________________________
        Pkg-ExpPsy-PyMVPA mailing list
        Pkg-ExpPsy-PyMVPA@lists.__alioth.debian.org
        <mailto:[email protected]>
        
http://lists.alioth.debian.__org/cgi-bin/mailman/listinfo/__pkg-exppsy-pymvpa
        
<http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa>


    --
    Joset A. Etzel, Ph.D.
    Research Analyst
    Cognitive Control & Psychopathology Lab
    Washington University in St. Louis
    http://mvpa.blogspot.com/

    _________________________________________________
    Pkg-ExpPsy-PyMVPA mailing list
    Pkg-ExpPsy-PyMVPA@lists.__alioth.debian.org
    <mailto:[email protected]>
    
http://lists.alioth.debian.__org/cgi-bin/mailman/listinfo/__pkg-exppsy-pymvpa
    <http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa>




_______________________________________________
Pkg-ExpPsy-PyMVPA mailing list
[email protected]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa


--
Joset A. Etzel, Ph.D.
Research Analyst
Cognitive Control & Psychopathology Lab
Washington University in St. Louis
http://mvpa.blogspot.com/

_______________________________________________
Pkg-ExpPsy-PyMVPA mailing list
[email protected]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa

Re: [pymvpa] unbalanced datasets

Reply via email to