Hi, so i am trying to read a little deeper into this and was wondering if you can give me some insight of yours into this.
so we construct a random projection, that's clear and that will require one pass over the original matrix or covariance matrix in case of PCA. But... it goes to say that before we project onto Q, we need to make sure its span captures principal left eigenvectors. Which in its turn suggests per alg 4.1 another pass over A and then running a Gram Schmidt procedure. so what's your insight about this -- 1) do we really need to do 2 passes over A to get to the projection? if not, why not? 2) how do we orthonormalize Q efficiently? They seem to have mentioned (so far) that they themselves used Gramm-Schmidt with double orthogonalization for that purpose. I am not familiar with this particular permutation, can we do it in non-iterative way? I am only familiar with 'standard' Gramm-Schmidt which is strictly iterative. But to iterativeness is averse to MR (and even if it weren't, that probably would mean we would need to run another MR here just to orthonormalize Q unless again we can figure how to combine that with the projection job). i guess it still may be simplified further, i haven't finished reading all of it, but if there are answers further in the paper, do you think you could point the section that talks about variation that is most suitable for MR? Thank you very much. -Dmitriy On Mon, Mar 22, 2010 at 8:12 PM, Jake Mannix <jake.man...@gmail.com> wrote: > Actually, maybe what you were thinking (at least, what *I* am thinking) is > that you can indeed do it on one pass through the *original* data (ie you > can > get away with never keeping a handle on the original data itself), because > on the "one pass" through that data, you spit out MultipleOutputs - one > SequenceFile of the randomly projected data, which doesn't hit a reducer > at all, and a second output which is the outer product of those vectors > with themselves, which its a summing reducer. > > In this sense, while you need to pass over the original data's *size* > (in terms of number of rows) a second time, if you want to consider > it data to be played with (instead of just "training" data for use on a > smaller subset or even totally different set), you don't need to pass > over the original entire data *set* ever again. > > -jake > > On Mon, Mar 22, 2010 at 6:35 PM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > You are probably right. I had a wild hare tromp through my thoughts the > > other day saying that one pass should be possible, but I can't > reconstruct > > the details just now. > > > > On Mon, Mar 22, 2010 at 6:00 PM, Jake Mannix <jake.man...@gmail.com> > > wrote: > > > > > I guess if you mean just do a random projection on the original data, > you > > > can certainly do that in one pass, but that's random projection, not a > > > stochastic decomposition. > > > > > >