Dmitriy raises some very good points here. The actual implementation will need to address the small size block problem. My guess is that the right way to do this is to touch the input classes to make sure that the minimum split size is configurable.
---------- Forwarded message ---------- From: Dmitriy Lyubimov <dlie...@gmail.com> Date: Wed, Apr 14, 2010 at 10:59 AM Subject: Fwd: need review for stochastic matrix decomposition doc To: Ted Dunning <ted.dunn...@gmail.com> got thrown out for spam again :) ---------- Forwarded message ---------- From: Dmitriy Lyubimov <dlie...@gmail.com> Date: Wed, Apr 14, 2010 at 10:57 AM Subject: Re: need review for stochastic matrix decomposition doc To: mahout-dev@lucene.apache.org upon some practical thought given, i also can see couple of minor difficulties here. 1) One cannot orthogonalize Q-block if there are less dimensions than vectors. Since # of vectors is (k+p), block orthogonalization will fail if mapper is fed less than (k+p) rows of A. The # of items fed to a mapper is pretty much a whim of JobTracker and HDFS replication system. So there needs to be a recourse for that case, and i can't really think of a good one. Similar problem arises for the last block but it is a smaller one. Assuming one can keep previous Q-block, then it just can be re-orthogonalized along with the leftovers. 2) if the goal is to orthogonalize as many dimentions of Q column vectors at a time as possible, then it goes contrary to the goal of computing B blocks. Indeed, you need to compute entire Q block first and then you can produce sum of outer products (B block), but you have to keep all the A rows around all the time in order to be able to produce outer products with Q. A rows are long (much longer than Q rows) so it may be more practical to leave computation of B for another pass over A (esp. if A rows are dense and extremely high-dimentional, perhaps up to several hundred million entries. -dmitriy On Tue, Apr 13, 2010 at 10:22 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > I have been testing the in-core version of this without the blocking and > the > results have been very impressive. > > On Tue, Apr 13, 2010 at 10:14 AM, Dmitriy Lyubimov <dlie...@gmail.com > >wrote: > > > I see . to minimize loss of orthogonality. Brilliant. > > > > On Tue, Apr 13, 2010 at 9:41 AM, Ted Dunning <ted.dunn...@gmail.com> > > wrote: > > > > > What you are suggesting is close to what I am doing. > > > > > > There are two differences, > > > > > > 1) the number of rows does depend on k+p, but not directly. The number > > of > > > rows should be as large as practical to make the operations more > > efficient. > > > > > > 2) Oops. You are right. The columns will need to be renormalized by > the > > n > > > (the number of blocks) > > > > > > On Tue, Apr 13, 2010 at 8:42 AM, Dmitriy Lyubimov <dlie...@gmail.com> > > > wrote: > > > > > > > Why can't we say s=(k+p) and accumulate (k+p) x (k+p) A*Omega rows at > > one > > > > time in the mapper and make sure we orthogonalize (k+p) rows of Q in > > one > > > > block (by running a stock QR)? That way orthogonalization of the > final > > Q > > > > will be quaranteed (although columns of Q would be scaled at approx > > > > [m/(k+p)] )? > > > > > > > > -Dmitriy > > > > > > > > On Mon, Apr 12, 2010 at 9:32 AM, Ted Dunning <ted.dunn...@gmail.com> > > > > wrote: > > > > > > > > > I will be attaching Dmitry's pdf. > > > > > > > > > > The basic difference is that I have gone back to a form closer to > the > > > > > original paper to avoid computing an ill-conditioned SVD. > > > > > > > > > > Dmitriy also said: > > > > > > > > > > I guess at this point i don't understand the details of block QR. I > > > guess > > > > > > i'll need to read up on block QR. I guess you could indeed go > into > > a > > > > > little > > > > > > more into details of that qr() call and perhaps give geometry of > > > > > individual > > > > > > Yi, Qi and Ri matrices in step 1 ( i assume they should have s x > > > (k+p), > > > > s > > > > > x > > > > > > ?, ? x (k+p) where s is the block height .) > > > > > > > > > > > > > > > > There isn't anything magic about the block QR and I doubt you will > > find > > > > any > > > > > information on the variant that I am using. The basic idea is just > > > that > > > > I > > > > > am doing QR decompositions on row-wise blocks of A Omega. > > > > > > > > > > On Sun, Apr 11, 2010 at 10:54 PM, Ted Dunning < > ted.dunn...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Can you attach that to the JIRA? Attachments are stripped from > the > > > > > mailing > > > > > > list, I think. > > > > > > > > > > > > On Sun, Apr 11, 2010 at 9:55 PM, Dmitriy Lyubimov < > > dlie...@gmail.com > > > > > >wrote: > > > > > > > > > > > >> For reference, i am attaching summary of our previous discussion > > > > which > > > > > i > > > > > >> took liberty of slightly reworking. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >