I'm going to dive in finally add my $0.02 on this whole "0.20 API" issue
in DistributedRowMatrix:

I very strongly feel that we should *not* constrain ourselves to use the
new apis in the case of functionality which is *missing* in the new API,
in particular: map-side joins.  As has been mentioned by Dmitriy and
others:

On Sun, Jan 2, 2011 at 9:32 PM, Dmitriy Lyubimov <[email protected]> wrote:
>
> Second remark is that blockwise multiplication is also pointless for
> sufficiently sparse matrices. Indeed, sum of outer products of columns and
> rows with intermediate reduction in combiners is by far most promising in
> terms of shuffle/sort io.


Doing a matrix multiplication in one MR pass is HUGE in comparison to
having to do reduce-side joins and go through second (or third!) shuffle
phases.  When you consider doing this K times during Lanczos iteration,
switching to reduce-side matrix multiplication is a non-starter for me.

In addition, this particular operation (matrix multiplication) is just one
instance of a fairly general action (LDA would scale better if it also
did a join of the topic/word parameter matrix and the corpus on each
iteration, so the entire matrix wasn't loaded into memory on every
mapper), and doing joins in the reducer means you often have to make
extra passes, as opposed to joining in the mapper and getting a full
shuffle-reduce step after to do more work.

So yeah, that's me just saying all this again:


> On another note, if input is similarly partitioned (not always the case),
> then map-side multiplication will always be I/O superior to reduce-side
> multiplication since while I/O is less and especially less in the keyset
> cardinality undergoing thru sorters. The power of map-side operations comes
> from the notion that yes we require a lot from the input but no, it's not a
> lot if input is already part of a bigger MR pipeline.
>

In general, until feature parity is achieved on the new apis in a hadoop
distribution which is industry standard, I don't think we should constrain
ourselves to *removing* functionality for the sake of getting rid of
deprecation warnings.

  -jake

Reply via email to