In article <[EMAIL PROTECTED]>,
Dr. V. Ravi <[EMAIL PROTECTED]> wrote:
> Thank very much all of you who gave invlauable suggestions and links
> to papers. It is pretty common in almost all bioinformatics problems
> that the number of variables swamp the number of samples. Hence, I
> wanted to perform feature selection by selecting the first few PCs
> before going for further studies. My student has used the 'snap shot
> method' of L. Sirovich, contained in MATLAB. I wanted to have a
> theoritical justification of what he had done. However, the other
> mehods suggested by the scholars in this forum do certainly enhance my
> understanding of the problem and the possible solutions.
I think the responses so far may not have made clear that the situation
is really very simple. NO SPECIAL ALGORITHMS ARE NEEDED. You just need
to transpose the data matrix, do PCA with whatever algorithm you like,
and then transform the results.
Suppose X is the data matrix, with the columns being variables and the
rows being cases ("samples" in your terminology). Assume the
variables have been centered by subtracting their mean, so the sum of
the the values in each column of X is zero. (Some people might want
to scale the variables to have variance one too.) For your situation,
the number of columns is much greater than the number of rows.
The principle components are simply the eigenvectors of the matrix X'X
(ie, X-transpose times X). Since X'X is big, however, we'd rather not
deal with it. So instead we find the eigenvectors of XX'. Suppose v
is such an eigenvector, with eigenvalue a, so that (XX')v = av. Then
it's easy to see that (X'X)(X'v) = X'(XX')v = X'av = a(X'v), so X'v
is an eigenvector of X'X, with eigenvalue a.
So you just need to find the matrix of principle component vectors for
X', rather than X, then multiply by X' to get the principle components
you want. If you want your principle component vectors to have length
one, you'll need to adjust for that too.
Here's an illustration in matlab:
Here's the original data matrix:
>> A
A =
4 2 1 9 3 3
3 2 1 0 9 1
0 4 1 3 0 0
8 1 3 0 1 8
Here it is centred:
>> X = A - ones(4,1)*mean(A)
X =
0.2500 -0.2500 -0.5000 6.0000 -0.2500 0
-0.7500 -0.2500 -0.5000 -3.0000 5.7500 -2.0000
-3.7500 1.7500 -0.5000 0 -3.2500 -3.0000
4.2500 -1.2500 1.5000 -3.0000 -2.2500 5.0000
Here is matlab's idea of the principle components:
>> princomp(X)
ans =
0.5865 0.0498 0.3412 0.2459 -0.2950 -0.6242
-0.1917 0.0391 -0.2155 0.7486 0.5386 -0.2545
0.1856 0.0247 -0.0532 -0.5897 0.6724 -0.4028
-0.3812 0.6628 0.6361 -0.0258 0.0997 -0.0149
-0.1424 -0.7208 0.6365 0.0574 0.2135 0.0783
0.6474 0.1910 0.1568 0.1654 0.3396 0.6141
As seen below, these are just (in reverse order) the eigenvectors of
the 6x6 matrix X'X, except for the three with zero eigenvalues, which
are arbitrary:
>> [V1,D1]=eig(X'*X)
V1 =
0.2568 -0.6825 -0.0732 0.3412 0.0498 0.5865
0.9501 0.0794 0.0792 -0.2155 0.0391 -0.1917
0.0805 0.2660 -0.9407 -0.0532 0.0247 0.1856
0.0410 0.0620 -0.0727 0.6361 0.6628 -0.3812
0.1305 0.1942 -0.0170 0.6365 -0.7208 -0.1424
0.0784 0.6447 0.3129 0.1568 0.1910 0.6474
D1 =
-0.0000 0 0 0 0 0
0 0.0000 0 0 0 0
0 0 0.0000 0 0 0
0 0 0 34.4317 0 0
0 0 0 0 63.7342 0
0 0 0 0 0 83.0842
We can get the same result by finding the eigenvectors of the 4x4 matrix
XX', as seen below:
>> [V2,D2]=eig(X*X')
V2 =
0.5000 -0.6515 0.5195 -0.2358
0.5000 -0.2151 -0.8235 -0.1596
0.5000 0.7105 0.2053 -0.4506
0.5000 0.1562 0.0988 0.8461
D2 =
-0.0000 0 0 0
0 34.4317 0 0
0 0 63.7342 0
0 0 0 83.0842
>> V3=X'*V2
V3 =
0 -2.0021 0.3973 5.3464
0 1.2648 0.3119 -1.7473
0 0.3124 0.1975 1.6922
0 -3.7324 5.2913 -3.4743
0 -3.7347 -5.7547 -1.2982
0 -0.9202 1.5249 5.9015
This doesn't look like the same result, but that's just because the
eigenvectors aren't the same lengths. We can fix that as follows:
>> V3 ./ (ones(6,1)*sqrt(sum(V3.^2)))
Warning: Divide by zero.
ans =
NaN -0.3412 0.0498 0.5865
NaN 0.2155 0.0391 -0.1917
NaN 0.0532 0.0247 0.1856
NaN -0.6361 0.6628 -0.3812
NaN -0.6365 -0.7208 -0.1424
NaN -0.1568 0.1910 0.6474
The first column is meaningless, but that's fine, since there are
really only three principle components.
Putting it together, we get the following as the way to find principle
components of A when the number of columns in A is much greater than the
number of rows:
>> X = A - ones(size(A,1),1)*mean(A);
>> [V,D] = eig(X*X');
>> T = X'*V;
>> T ./ (ones(size(T,1),1)*sqrt(sum(T.^2)))
Warning: Divide by zero.
ans =
NaN -0.3412 0.0498 0.5865
NaN 0.2155 0.0391 -0.1917
NaN 0.0532 0.0247 0.1856
NaN -0.6361 0.6628 -0.3812
NaN -0.6365 -0.7208 -0.1424
NaN -0.1568 0.1910 0.6474
----------------------------------------------------------------------------
Radford M. Neal [EMAIL PROTECTED]
Dept. of Statistics and Dept. of Computer Science [EMAIL PROTECTED]
University of Toronto http://www.cs.utoronto.ca/~radford
----------------------------------------------------------------------------
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
. http://jse.stat.ncsu.edu/ .
=================================================================