I ran into a similar issue with unbalanced classification and wanted to look at the individual partitions as well. I couldn't figure out how to do so just in PyMVPA, so I ended up using a separate Python module, UnbalanceDataset: https://github.com/fmfn/UnbalancedDataset. With that, I sub-sampled the more common group to balance the two groups, which created a new dataset. I was then able to investigate what was going on in that dataset and what each of the partitions look like as if it were a regular dataset.
Is that what you're looking for? On Thu, Nov 19, 2015 at 9:27 AM, Ulrike Kuhl <k...@cbs.mpg.de> wrote: > Dear Yaroslav, dear all, > > I might have solved the balancing problem using the pyMVPA's 'Balancer' > (duh!). > I extended the code of the partitioner like this: > > npart = ChainNode([ > NFoldPartitioner(len(DS_noisy.sa['targets'].unique), > attr='chunks'), > ## so it should select only those splits where we took 1 from > ## each of the targets categories leaving things in balance > Sifter([('partitions', 2), > ('targets', > { 'uvalues': DS_noisy.sa['targets'].unique, > 'balanced': True}) > ]), > Balancer(attr='targets',count=1,limit='partitions',apply_selection=True) > ], space='partitions') > > > The classification result on noisy data looks perfect even on imbalanced > group sizes - is it correct to do it like this? > > Also, I would still like know how I can see how the individual partitions > look like. > > Thanks! > Ulrike > > > ----- Original Message ----- > From: "kuhl" <k...@cbs.mpg.de> > To: "pkg-exppsy-pymvpa" <pkg-exppsy-pymvpa@lists.alioth.debian.org> > Sent: Thursday, 19 November, 2015 10:40:31 > Subject: Re: [pymvpa] Dataset with multidimensional feature vector per voxel > > Dear Yaroslav, dear all, > > hooray for simulations! :-) > > I was not aware of the profound effect on classification performance if the > groups are not perfectly balanced. > My tests using the 'npart' partitioner on clean and noisy test data showed > the expected result (accuracy of 0.5 for non-signal voxels, 1 for the > others). Cool! > > Still, two questions remain: > a) Can I assess how the individual partitions look like (i.e. which subject > is additionally removed to make the groups balanced)? > > b) How do I deal with groups that have a larger imbalance? I've tried with my > dummy data already: If I feed a dataset with imbalanced group sizes into the > classification with 'npart' partitioner the result is random classification > at all voxels. > In my original data I have more participants in the second group than in the > first, so I would need to restrict the size of the second group given the > size of the first for each partition. My idea was to take everyone from group > 1 and randomly pick the same number of participants from group 2 - what's the > best way to realize this? > > Thanks a lot! > Ulrike > > ----- Original Message ----- > From: "Yaroslav Halchenko" <deb...@onerussian.com> > To: "pkg-exppsy-pymvpa" <pkg-exppsy-pymvpa@lists.alioth.debian.org> > Sent: Tuesday, 17 November, 2015 16:18:20 > Subject: Re: [pymvpa] Dataset with multidimensional feature vector per voxel > > On Tue, 17 Nov 2015, Ulrike Kuhl wrote: > >> Here you go: > >> print DS_clean.summary() > >> Dataset: 20x3375@float32, <sa: chunks,subject,targets>, <fa: >> modality,modality_index,voxel_indices>, <a: mapper> >> stats: mean=0.006 std=0.0704271 var=0.00495998 min=0 max=1 > >> Counts of targets in each chunk: >> chunks\targets 0 1 >> --- --- >> 0 1 0 >> 1 1 0 >> 2 1 0 >> 3 1 0 >> 4 1 0 >> 5 1 0 >> 6 1 0 >> 7 1 0 >> 8 1 0 >> 9 1 0 >> 10 0 1 >> 11 0 1 >> 12 0 1 >> 13 0 1 >> 14 0 1 >> 15 0 1 >> 16 0 1 >> 17 0 1 >> 18 0 1 >> 19 0 1 > > > so the problem is that in each chunk you have only one sample and > overall you have only 20 samples to train on. Whenever you NFold > partition it, you end up with 10 samples of one target and 9 of > another. If there is a clear signal, error is minimized to correct > labeling. If there is no signal, error is minimized to just always say > "class with majority of samples (10 vs 9)" which then always leads to > misclassification of the held out sample since it is of the opposite > class. > > the fun was if you just ran it on real data -- most probably you would have > got > some strong negative bias but possibly still some reasonable around chance > performances... and then would have scratched you head a lot. So -- > simulations rule! ;) > > The simplest way to handle it: guarantee balanced number of samples from > both categories in training (and thus testing) splits. > > There are two ways then to do it: > > 1. simplest but more ad-hoc. Group them all with chunks bringing two > samples from both classes together, so you end up with 10 chunks and > thus 10 splits if using NFold(1) > > 2. create a partitioner which would select all possible combinations > from the two classes, i.e. have 10*10=100 splits. > > Two ways to do it > > a. with existing codebase smth like this should work > > npart = ChainNode([ > NFoldPartitioner(len(ds.sa['targets'].unique), > attr='chunks'), > ## so it should select only those splits where we took 1 from > ## each of the targets categories leaving things in balance > Sifter([('partitions', 2), > ('targets', > { 'uvalues': ds.sa['targets'].unique, > 'balanced': True}) > ]), > ], space='partitions') > > which will do in your case NFold(2) across chunks, thus select every > combination of two chunks, but then use only those (Sifter removes others) > which have balanced targets. > > b. WiP > https://github.com/PyMVPA/PyMVPA/pull/386 > to simplify this so it would look like > > factpart = FactorialPartitioner( > NFoldPartitioner(attr='chunks'), > attr='targets' > ) > > N.B. Matteo -- one more testcase to test! ;) > > -- > Yaroslav O. Halchenko > Center for Open Neuroscience http://centerforopenneuroscience.org > Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 > Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419 > WWW: http://www.linkedin.com/in/yarik > > _______________________________________________ > Pkg-ExpPsy-PyMVPA mailing list > Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org > http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa > -- > Max Planck Institute for Human Cognitive and Brain Sciences > Department of Neuropsychology (A219) > Stephanstraße 1a > 04103 Leipzig > > Phone: +49 (0) 341 9940 2625 > Mail: k...@cbs.mpg.de > Internet: http://www.cbs.mpg.de/staff/kuhl-12160 > -- > Max Planck Institute for Human Cognitive and Brain Sciences > Department of Neuropsychology (A219) > Stephanstraße 1a > 04103 Leipzig > > Phone: +49 (0) 341 9940 2625 > Mail: k...@cbs.mpg.de > Internet: http://www.cbs.mpg.de/staff/kuhl-12160 > > _______________________________________________ > Pkg-ExpPsy-PyMVPA mailing list > Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org > http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa _______________________________________________ Pkg-ExpPsy-PyMVPA mailing list Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa