Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Andrew Hitchcock Tue, 04 Jan 2011 17:30:31 -0800

If there are specific patches you would like applied to Elastic
MapReduce, I would recommend asking for them on our forums:


https://forums.aws.amazon.com/forum.jspa?forumID=52

We are fairly receptive when it comes to customer feedback about patches.

Regards,
Andrew

On Sun, Jan 2, 2011 at 11:52 PM, Dmitriy Lyubimov <[email protected]> wrote:
> On another note, Sean is absolutely correct, Amazon ElasticMR indeed seems
> to be stuck with 0.20 (or, rather, stuck with a particular hadoop setup
> without much flexibility here). I guess moving ahead with APIs in Mahout
> would indeed create problems for whoever is using EMR (I don't).
>
> On Sun, Jan 2, 2011 at 9:32 PM, Dmitriy Lyubimov <[email protected]> wrote:
>
>> I would think blockwise multiplication (which is, by the way, has a
>> standard algorithm in Matrix computations by Van Loan and Golub), is pretty
>> pointless with Mahout since there's no blockwise matrix format presently,
>> and even if it were, no existing algorithms support it. All prep utils only
>> produce row-wise format. We could write a routine to "block" it but it would
>> seem to be an exercise in futility.
>>
>> Second remark is that blockwise multiplication is also pointless for
>> sufficiently sparse matrices. Indeed, sum of outer products of columns and
>> rows with intermediate reduction in combiners is by far most promising in
>> terms of shuffle/sort io. Outer products, when further split in columns or
>> rows, would also be quite sparse and hence small in size while reduction
>> in keyset cardinality is just gigantic compared to blockwise
>> multiplications. (That said, i never ran comparison benchmark of the two)
>>
>> Note that what authors essentially are suggesting (even in strategy 4)
>> that there is explosive growth of shuffle and sort keyset i/o, and what's
>> more, they say they never tried it in distributed mode(!). imagine hundreds
>> of machines sending a copy of their input to a lot of other machines in the
>> cluster. Summing outer products avoids broadcasting the input to multiple
>> reducers.
>>
>> On another note, if input is similarly partitioned (not always the case),
>> then map-side multiplication will always be I/O superior to reduce-side
>> multiplication since while I/O is less and especially less in the keyset
>> cardinality undergoing thru sorters. The power of map-side operations comes
>> from the notion that yes we require a lot from the input but no, it's not a
>> lot if input is already part of a bigger MR pipeline.
>>
>> Finally, back to 0.20/0.21 issue... I said before in this thread that
>> migrating to 0.21 would render Mahout incompatible with majority of
>> production frameworks out there. But after working with ssvd code, i came to
>> think of a compromise: since most of the production environments are running
>> Cloudera distribution, many 0.21 things are supported there and there's a
>> lot of code around that's written for new API which is backported in
>> Cloudera. It's difficult for me to judge how much Cloudera's implementation
>> covers of what is in 0.21 (in fact, i did come across a couple of 0.21
>> things still missing in CDH), but in terms of Hadoop compatibility, i think
>> Mahout project would be best served if it indeed moved on to a new api (i.e.
>> 0.21 ) but would not get ahead of what is supported in CDH3. That would keep
>> it on the edge of what's currently practical and out there. Keeping sitting
>> on the old api IMO is definitely a drag. My stochastic svd code is using new
>> api in CDH3 and i would very much not want to backport it to old api, it
>> would not be practical as everyone out there is on CDH and more so than on
>> 0.20.2.
>>
>> -Dmitriy
>>
>>
>>
>>>  Some more general remarks: I think the matrix multiplication can be
>>>> implemented more efficiently. I've done a matrix multiplication of a sparse
>>>> 500kx15k matrix with around 35 million elements on a quite powerful cluster
>>>> of 10 nodes, and this took around 30 minutes. I have no idea of the
>>>> performance of the implementation described at
>>>> http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't
>>>> really compare. But Imho this can be improved ( though it's possible that
>>>> the poor performance was due to mistakes made by me )
>>>>
>>> I will definitely investigate these methods over the coming days, these
>>> look fantastic.
>>>
>>> Shannon
>>>
>>
>>
>

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Reply via email to