Re: PCA on a large data matrix of order 150x11000 - its really simple

Radford Neal Fri, 25 Oct 2002 09:03:49 -0700

In article <[EMAIL PROTECTED]>,
Dr. V. Ravi <[EMAIL PROTECTED]> wrote:


> Thank very much all of you who gave invlauable suggestions and links
> to papers. It is pretty common in almost all bioinformatics problems
> that the number of variables swamp the number of samples. Hence, I
> wanted to perform feature selection by selecting the first few PCs
> before going for further studies. My student has used the 'snap shot
> method' of L. Sirovich, contained in MATLAB. I wanted to have a
> theoritical justification of what he had done. However, the other
> mehods suggested by the scholars in this forum do certainly enhance my
> understanding of the problem and the possible solutions.


I think the responses so far may not have made clear that the situation
is really very simple.  NO SPECIAL ALGORITHMS ARE NEEDED.  You just need
to transpose the data matrix, do PCA with whatever algorithm you like,
and then transform the results.

Suppose X is the data matrix, with the columns being variables and the
rows being cases ("samples" in your terminology).  Assume the
variables have been centered by subtracting their mean, so the sum of
the the values in each column of X is zero.  (Some people might want
to scale the variables to have variance one too.)  For your situation,
the number of columns is much greater than the number of rows.

The principle components are simply the eigenvectors of the matrix X'X
(ie, X-transpose times X).  Since X'X is big, however, we'd rather not
deal with it.  So instead we find the eigenvectors of XX'.  Suppose v
is such an eigenvector, with eigenvalue a, so that (XX')v = av.  Then
it's easy to see that (X'X)(X'v) = X'(XX')v = X'av = a(X'v), so X'v
is an eigenvector of X'X, with eigenvalue a.  

So you just need to find the matrix of principle component vectors for
X', rather than X, then multiply by X' to get the principle components
you want.  If you want your principle component vectors to have length
one, you'll need to adjust for that too.

Here's an illustration in matlab:


  Here's the original data matrix:

  >> A
  
  A =
  
       4     2     1     9     3     3
       3     2     1     0     9     1
       0     4     1     3     0     0
       8     1     3     0     1     8

  Here it is centred:
  
  >> X = A - ones(4,1)*mean(A)
  
  X =
  
      0.2500   -0.2500   -0.5000    6.0000   -0.2500         0
     -0.7500   -0.2500   -0.5000   -3.0000    5.7500   -2.0000
     -3.7500    1.7500   -0.5000         0   -3.2500   -3.0000
      4.2500   -1.2500    1.5000   -3.0000   -2.2500    5.0000

  Here is matlab's idea of the principle components:
  
  >> princomp(X)              
  
  ans =
  
      0.5865    0.0498    0.3412    0.2459   -0.2950   -0.6242
     -0.1917    0.0391   -0.2155    0.7486    0.5386   -0.2545
      0.1856    0.0247   -0.0532   -0.5897    0.6724   -0.4028
     -0.3812    0.6628    0.6361   -0.0258    0.0997   -0.0149
     -0.1424   -0.7208    0.6365    0.0574    0.2135    0.0783
      0.6474    0.1910    0.1568    0.1654    0.3396    0.6141
  
  As seen below, these are just (in reverse order) the eigenvectors of
  the 6x6 matrix X'X, except for the three with zero eigenvalues, which 
  are arbitrary:
  
  >> [V1,D1]=eig(X'*X)
  
  V1 =
  
      0.2568   -0.6825   -0.0732    0.3412    0.0498    0.5865
      0.9501    0.0794    0.0792   -0.2155    0.0391   -0.1917
      0.0805    0.2660   -0.9407   -0.0532    0.0247    0.1856
      0.0410    0.0620   -0.0727    0.6361    0.6628   -0.3812
      0.1305    0.1942   -0.0170    0.6365   -0.7208   -0.1424
      0.0784    0.6447    0.3129    0.1568    0.1910    0.6474
  
  
  D1 =
  
     -0.0000         0         0         0         0         0
           0    0.0000         0         0         0         0
           0         0    0.0000         0         0         0
           0         0         0   34.4317         0         0
           0         0         0         0   63.7342         0
           0         0         0         0         0   83.0842

  We can get the same result by finding the eigenvectors of the 4x4 matrix
  XX', as seen below:
  
  >> [V2,D2]=eig(X*X')
  
  V2 =
  
      0.5000   -0.6515    0.5195   -0.2358
      0.5000   -0.2151   -0.8235   -0.1596
      0.5000    0.7105    0.2053   -0.4506
      0.5000    0.1562    0.0988    0.8461
  
  
  D2 =
  
     -0.0000         0         0         0
           0   34.4317         0         0
           0         0   63.7342         0
           0         0         0   83.0842
  
  >> V3=X'*V2                          
  
  V3 =
  
           0   -2.0021    0.3973    5.3464
           0    1.2648    0.3119   -1.7473
           0    0.3124    0.1975    1.6922
           0   -3.7324    5.2913   -3.4743
           0   -3.7347   -5.7547   -1.2982
           0   -0.9202    1.5249    5.9015

  This doesn't look like the same result, but that's just because the
  eigenvectors aren't the same lengths.  We can fix that as follows:
  
  >> V3 ./ (ones(6,1)*sqrt(sum(V3.^2)))
  Warning: Divide by zero.
  
  ans =
  
         NaN   -0.3412    0.0498    0.5865
         NaN    0.2155    0.0391   -0.1917
         NaN    0.0532    0.0247    0.1856
         NaN   -0.6361    0.6628   -0.3812
         NaN   -0.6365   -0.7208   -0.1424
         NaN   -0.1568    0.1910    0.6474

  The first column is meaningless, but that's fine, since there are
  really only three principle components.

  Putting it together, we get the following as the way to find principle 
  components of A when the number of columns in A is much greater than the 
  number of rows:

  >> X = A - ones(size(A,1),1)*mean(A);
  >> [V,D] = eig(X*X');                
  >> T = X'*V;
  >> T ./ (ones(size(T,1),1)*sqrt(sum(T.^2)))
  Warning: Divide by zero.
  
  ans =
  
         NaN   -0.3412    0.0498    0.5865
         NaN    0.2155    0.0391   -0.1917
         NaN    0.0532    0.0247    0.1856
         NaN   -0.6361    0.6628   -0.3812
         NaN   -0.6365   -0.7208   -0.1424
         NaN   -0.1568    0.1910    0.6474

----------------------------------------------------------------------------
Radford M. Neal                                       [EMAIL PROTECTED]
Dept. of Statistics and Dept. of Computer Science [EMAIL PROTECTED]
University of Toronto                     http://www.cs.utoronto.ca/~radford
----------------------------------------------------------------------------
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: PCA on a large data matrix of order 150x11000 - its really simple

Reply via email to