Hi Olivier, all After playing around some more, I may have a partial solution. But I would appreciate it if you could help me check some assumptions.
First, I realized that my original PCA did not make much sense. What I want to do is reduce the feature dimensions in my classification, but keep the number of observations. So I'm transposing the X array before fitting the PCA with it. Additionally, I'm fitting the whole array, but then plotting each label in different colors to visualize structure. Also, as Olivier suggested, I'm fitting a 3-dimensional PCA. Here's what this looks like, in code and plots: http://web.mit.edu/mwaskom/www/bad_pca.png Recall that this has no class information; the labels correspond to each of four fMRI runs, but I'm trying to classify conditions that are equally distributed within these runs. Clearly, there is structure in there that shouldn't exist. One possible explanation is that MR intensity varies over time. In my preprocessing, each run gets globally scaled so that the median value of each run is the same (note that nothing is done to the variance). Even though the distribution of values across features for each run looks the same, I'm suspecting that maybe something is happening here that's upsetting the classification. As a possible solution, I've called sklearn.preprocessing.scale on each label-wise chunk of the data and then repeated the pca procedure. This yields a result much closer to what I would expect: http://web.mit.edu/mwaskom/www/good_pca.png (I just caught the duplication if the i variable as I'm copying these figures, but it doesn't affect the results). So, while this doesn't fully explain my original question (the class domination), it does seem to be a potential solution. One important question, though, is whether it will be valid to scale my features within each run. My intuition is that it's fine as long as I am doing leave-one-run-out cross validation, as the test set won't have been transformed with any parameters determined from the training set. But I think it's best to double-check with the experts. Also, if there's anything you would suggest I do from here to further elucidate the cause for the structure in bad_pca.png, I'd be happy to look into it. Best, Michael On Mon, Jan 30, 2012 at 12:32 AM, Olivier Grisel <[email protected]> wrote: > > 2012/1/29 Michael Waskom <[email protected]>: > > Aha, this does indeed suggest something strange: > > > > http://web.mit.edu/mwaskom/www/pca.png > > > > I'm going to dig into this some more, but I don't really have any > > strong intuitions to guide me here so if anything pops out at you from > > that do feel free to speak up :) > > Can you do a PCA(3) and plot projection of axis 0 vs axis 1, then axis > 0 vs axis 1 and axis 1 vs axis 2? > > It seems that class "r+" is almost linearly separable using the first > 2 components while the 3 others are not at all. > > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel > > ------------------------------------------------------------------------------ > Try before you buy = See our experts in action! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-dev2 > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
