On another note, Sean is absolutely correct, Amazon ElasticMR indeed seems to be stuck with 0.20 (or, rather, stuck with a particular hadoop setup without much flexibility here). I guess moving ahead with APIs in Mahout would indeed create problems for whoever is using EMR (I don't).
On Sun, Jan 2, 2011 at 9:32 PM, Dmitriy Lyubimov <[email protected]> wrote: > I would think blockwise multiplication (which is, by the way, has a > standard algorithm in Matrix computations by Van Loan and Golub), is pretty > pointless with Mahout since there's no blockwise matrix format presently, > and even if it were, no existing algorithms support it. All prep utils only > produce row-wise format. We could write a routine to "block" it but it would > seem to be an exercise in futility. > > Second remark is that blockwise multiplication is also pointless for > sufficiently sparse matrices. Indeed, sum of outer products of columns and > rows with intermediate reduction in combiners is by far most promising in > terms of shuffle/sort io. Outer products, when further split in columns or > rows, would also be quite sparse and hence small in size while reduction > in keyset cardinality is just gigantic compared to blockwise > multiplications. (That said, i never ran comparison benchmark of the two) > > Note that what authors essentially are suggesting (even in strategy 4) > that there is explosive growth of shuffle and sort keyset i/o, and what's > more, they say they never tried it in distributed mode(!). imagine hundreds > of machines sending a copy of their input to a lot of other machines in the > cluster. Summing outer products avoids broadcasting the input to multiple > reducers. > > On another note, if input is similarly partitioned (not always the case), > then map-side multiplication will always be I/O superior to reduce-side > multiplication since while I/O is less and especially less in the keyset > cardinality undergoing thru sorters. The power of map-side operations comes > from the notion that yes we require a lot from the input but no, it's not a > lot if input is already part of a bigger MR pipeline. > > Finally, back to 0.20/0.21 issue... I said before in this thread that > migrating to 0.21 would render Mahout incompatible with majority of > production frameworks out there. But after working with ssvd code, i came to > think of a compromise: since most of the production environments are running > Cloudera distribution, many 0.21 things are supported there and there's a > lot of code around that's written for new API which is backported in > Cloudera. It's difficult for me to judge how much Cloudera's implementation > covers of what is in 0.21 (in fact, i did come across a couple of 0.21 > things still missing in CDH), but in terms of Hadoop compatibility, i think > Mahout project would be best served if it indeed moved on to a new api (i.e. > 0.21 ) but would not get ahead of what is supported in CDH3. That would keep > it on the edge of what's currently practical and out there. Keeping sitting > on the old api IMO is definitely a drag. My stochastic svd code is using new > api in CDH3 and i would very much not want to backport it to old api, it > would not be practical as everyone out there is on CDH and more so than on > 0.20.2. > > -Dmitriy > > > >> Some more general remarks: I think the matrix multiplication can be >>> implemented more efficiently. I've done a matrix multiplication of a sparse >>> 500kx15k matrix with around 35 million elements on a quite powerful cluster >>> of 10 nodes, and this took around 30 minutes. I have no idea of the >>> performance of the implementation described at >>> http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't >>> really compare. But Imho this can be improved ( though it's possible that >>> the poor performance was due to mistakes made by me ) >>> >> I will definitely investigate these methods over the coming days, these >> look fantastic. >> >> Shannon >> > >
