Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Sean Owen Thu, 06 Jan 2011 14:02:36 -0800

Yah... I actually tend to agree, since it's pretty useful, and is
apparently making a come-back. I personally could go with that. It'd
be great to have more standardization and all that but that can come
later as Hadoop more easily permits it.


That said there are some aspects of the old API I think we can stop
using, and suppose we should update where possible. I am operating
under the assumption that .mapreduce. is still going to supersede
.mapred. at some point. Is that our opinion?

On Thu, Jan 6, 2011 at 9:36 PM, Jake Mannix <[email protected]> wrote:
> I'm going to dive in finally add my $0.02 on this whole "0.20 API" issue
> in DistributedRowMatrix:
>
> I very strongly feel that we should *not* constrain ourselves to use the
> new apis in the case of functionality which is *missing* in the new API,
> in particular: map-side joins.  As has been mentioned by Dmitriy and
> others:
>
> On Sun, Jan 2, 2011 at 9:32 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>
>> Second remark is that blockwise multiplication is also pointless for
>> sufficiently sparse matrices. Indeed, sum of outer products of columns and
>> rows with intermediate reduction in combiners is by far most promising in
>> terms of shuffle/sort io.
>
>
> Doing a matrix multiplication in one MR pass is HUGE in comparison to
> having to do reduce-side joins and go through second (or third!) shuffle
> phases.  When you consider doing this K times during Lanczos iteration,
> switching to reduce-side matrix multiplication is a non-starter for me.
>
> In addition, this particular operation (matrix multiplication) is just one
> instance of a fairly general action (LDA would scale better if it also
> did a join of the topic/word parameter matrix and the corpus on each
> iteration, so the entire matrix wasn't loaded into memory on every
> mapper), and doing joins in the reducer means you often have to make
> extra passes, as opposed to joining in the mapper and getting a full
> shuffle-reduce step after to do more work.
>
> So yeah, that's me just saying all this again:
>
>
>> On another note, if input is similarly partitioned (not always the case),
>> then map-side multiplication will always be I/O superior to reduce-side
>> multiplication since while I/O is less and especially less in the keyset
>> cardinality undergoing thru sorters. The power of map-side operations comes
>> from the notion that yes we require a lot from the input but no, it's not a
>> lot if input is already part of a bigger MR pipeline.
>>
>
> In general, until feature parity is achieved on the new apis in a hadoop
> distribution which is industry standard, I don't think we should constrain
> ourselves to *removing* functionality for the sake of getting rid of
> deprecation warnings.
>
>  -jake
>

Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Reply via email to