If there are specific patches you would like applied to Elastic MapReduce, I would recommend asking for them on our forums:
https://forums.aws.amazon.com/forum.jspa?forumID=52 We are fairly receptive when it comes to customer feedback about patches. Regards, Andrew On Sun, Jan 2, 2011 at 11:52 PM, Dmitriy Lyubimov <[email protected]> wrote: > On another note, Sean is absolutely correct, Amazon ElasticMR indeed seems > to be stuck with 0.20 (or, rather, stuck with a particular hadoop setup > without much flexibility here). I guess moving ahead with APIs in Mahout > would indeed create problems for whoever is using EMR (I don't). > > On Sun, Jan 2, 2011 at 9:32 PM, Dmitriy Lyubimov <[email protected]> wrote: > >> I would think blockwise multiplication (which is, by the way, has a >> standard algorithm in Matrix computations by Van Loan and Golub), is pretty >> pointless with Mahout since there's no blockwise matrix format presently, >> and even if it were, no existing algorithms support it. All prep utils only >> produce row-wise format. We could write a routine to "block" it but it would >> seem to be an exercise in futility. >> >> Second remark is that blockwise multiplication is also pointless for >> sufficiently sparse matrices. Indeed, sum of outer products of columns and >> rows with intermediate reduction in combiners is by far most promising in >> terms of shuffle/sort io. Outer products, when further split in columns or >> rows, would also be quite sparse and hence small in size while reduction >> in keyset cardinality is just gigantic compared to blockwise >> multiplications. (That said, i never ran comparison benchmark of the two) >> >> Note that what authors essentially are suggesting (even in strategy 4) >> that there is explosive growth of shuffle and sort keyset i/o, and what's >> more, they say they never tried it in distributed mode(!). imagine hundreds >> of machines sending a copy of their input to a lot of other machines in the >> cluster. Summing outer products avoids broadcasting the input to multiple >> reducers. >> >> On another note, if input is similarly partitioned (not always the case), >> then map-side multiplication will always be I/O superior to reduce-side >> multiplication since while I/O is less and especially less in the keyset >> cardinality undergoing thru sorters. The power of map-side operations comes >> from the notion that yes we require a lot from the input but no, it's not a >> lot if input is already part of a bigger MR pipeline. >> >> Finally, back to 0.20/0.21 issue... I said before in this thread that >> migrating to 0.21 would render Mahout incompatible with majority of >> production frameworks out there. But after working with ssvd code, i came to >> think of a compromise: since most of the production environments are running >> Cloudera distribution, many 0.21 things are supported there and there's a >> lot of code around that's written for new API which is backported in >> Cloudera. It's difficult for me to judge how much Cloudera's implementation >> covers of what is in 0.21 (in fact, i did come across a couple of 0.21 >> things still missing in CDH), but in terms of Hadoop compatibility, i think >> Mahout project would be best served if it indeed moved on to a new api (i.e. >> 0.21 ) but would not get ahead of what is supported in CDH3. That would keep >> it on the edge of what's currently practical and out there. Keeping sitting >> on the old api IMO is definitely a drag. My stochastic svd code is using new >> api in CDH3 and i would very much not want to backport it to old api, it >> would not be practical as everyone out there is on CDH and more so than on >> 0.20.2. >> >> -Dmitriy >> >> >> >>> Some more general remarks: I think the matrix multiplication can be >>>> implemented more efficiently. I've done a matrix multiplication of a sparse >>>> 500kx15k matrix with around 35 million elements on a quite powerful cluster >>>> of 10 nodes, and this took around 30 minutes. I have no idea of the >>>> performance of the implementation described at >>>> http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't >>>> really compare. But Imho this can be improved ( though it's possible that >>>> the poor performance was due to mistakes made by me ) >>>> >>> I will definitely investigate these methods over the coming days, these >>> look fantastic. >>> >>> Shannon >>> >> >> >
