Hi,

so i am trying to read a little deeper into this and was wondering if you
can give me some insight of yours into this.

so we construct a random projection, that's clear and that will require one
pass over the original matrix or covariance matrix in case of PCA.

But... it goes to say that before we project onto Q, we need to make sure
its span captures principal left eigenvectors. Which in its turn suggests
per alg 4.1 another pass over A and then running a Gram Schmidt procedure.

so what's your insight about this --

1) do we really need to do 2 passes over A to get to the projection? if not,
why not?

2) how do we orthonormalize Q efficiently? They seem to have mentioned (so
far) that they themselves used Gramm-Schmidt with double orthogonalization
for that purpose. I am not familiar with this particular permutation, can we
do it in non-iterative way? I am only familiar with 'standard' Gramm-Schmidt
which is strictly iterative. But to iterativeness is averse to MR (and even
if it weren't, that probably would mean we would need to run another MR here
just to orthonormalize Q unless again we can figure how to combine that with
the projection job).

i guess it still may be simplified further, i haven't finished reading all
of it, but if there are answers further in the paper, do you think you could
point the section that talks about variation that is most suitable for MR?


Thank you very much.

-Dmitriy

On Mon, Mar 22, 2010 at 8:12 PM, Jake Mannix <jake.man...@gmail.com> wrote:

> Actually, maybe what you were thinking (at least, what *I* am thinking) is
> that you can indeed do it on one pass through the *original* data (ie you
> can
> get away with never keeping a handle on the original data itself), because
> on the "one pass" through that data, you spit out MultipleOutputs - one
> SequenceFile of the randomly projected data, which doesn't hit a reducer
> at all, and a second output which is the outer product of those vectors
> with themselves, which its a summing reducer.
>
> In this sense, while you need to pass over the original data's *size*
> (in terms of number of rows) a second time, if you want to consider
> it data to be played with (instead of just "training" data for use on a
> smaller subset or even totally different set), you don't need to pass
> over the original entire data *set* ever again.
>
>  -jake
>
> On Mon, Mar 22, 2010 at 6:35 PM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > You are probably right.  I had a wild hare tromp through my thoughts the
> > other day saying that one pass should be possible, but I can't
> reconstruct
> > the details just now.
> >
> > On Mon, Mar 22, 2010 at 6:00 PM, Jake Mannix <jake.man...@gmail.com>
> > wrote:
> >
> > > I guess if you mean just do a random projection on the original data,
> you
> > > can certainly do that in one pass, but that's random projection, not a
> > > stochastic decomposition.
> > >
> >
>

Reply via email to