Awesome! Let me know how it goes. I would love to kibitz but can't code anything.
One thing that I did notice is that after computing y = AR, a critical step is SVD of y. This is best done, perhaps, by decomposing y'y since that is a k x k matrix. So ... if R is defined by a PRNG with well defined seeds for each row, then you don't have to store it. It is particularly nice if you can compute the n-th random number from a particular seed at high speed. If you can do that, then multiplication by a sparse matrix is very clean. You can probably do that with most generators by building a moral equivalent of a skip list ... just save the seed every few hundred elements. Better to have a generator that can be iterated more efficiently, but usable. Note the comments in the articles about how very poor generators work just fine. Now in a MR setting, A will probably be split into row-wise patches. y = AR is the row-wise concatenation of these. With the PRNG trick, each mapper can have an implicit copy of R. Otherwise, it will have to be distributed as a side file to all mappers and a merge will be necessary. But if we are after y' y as well, we can compute this by taking outer products of each row of y and summing the resulting k x k matrices. This works wonderfully because combiners and such can combine partial map outputs. You probably still have to store y itself, but with a few fancy keys and a multiple output format, you should be good to go. On Sun, Sep 27, 2009 at 9:03 PM, Jake Mannix <[email protected]> wrote: > That shouldn't be too hard to Hadoop-ify. I'll see if I can try it out > while porting over the decomposer stuff. > -- Ted Dunning, CTO DeepDyve
