Re: [GSOC] Matrix Operations on HDFS

Jake Mannix Sun, 30 May 2010 18:48:18 -0700

On Sun, May 30, 2010 at 6:30 PM, Ted Dunning <[email protected]> wrote:

> The Distributed Row Matrix should be ideal for this.  When you run mappers
> against this data structure, each mapper gets a different row.  You can use
> assign to compute your function on each element of a row in the mapper.
>  Define number of reducers = 0 and you are set.
>

Heh, Ted, you are aware that the instance methods I described were
hypothetical,
right?  I can certainly add them pretty easily (as I'm sure Sisir could too,
in a
patch), but they're "throwing NotYetImplemented" currently, as one might
say.

> Are you sure that you don't need some kind of reduction function, however?
>

This is the part I'd really like to write up (also NYE), where you give the
DistributedRowMatrix a "mapper" of some kind, which on each row, takes
that Vector and produces one or more of linear key-value pairs: keys could
be either Null or Integer (possibly Pair<Integer,Integer> ?), and values
could be Integer, Double, Vector, or Matrix.  You also pass into that
same method call a "reducer" which does the obvious thing, eventually
spitting out key, value pairs of the same linear types (and if it ends up
being (Null, Vector), or (Integer, Double) there could be nice way to make
this function have Vector return type, and if instead the reduce spits
out (Pair<Int,Int>, Double), (Int, Vector), or (Null, Matrix), it could
return
a DistributedRowMatrix).

I've been wanting to add something like this, a kind of "numeric-specific"
but otherwise generic MapReduce api, for a while, but I've been holding
off on account of not wanting to overengineer, if nobody would be using
it.

This is why, Sisir, I'd like to know exactly what kinds of operations you'd
want to do on a big sparse HDFS-backed matrix - simple mutation of
the rows, based on some inputs, or do you need to do some aggregation
across rows and make new kinds of reduced output, or what?  Could
you maybe give a little write-up of how the RBM you're coding up works,
for those of us not "in the thick of it", like Shannon did last week?

  -jake

> You might also look at the k-means clustering which probably is related to
> what you are doing in some sense.
>
> On Sun, May 30, 2010 at 3:24 PM, Sisir Koppaka <[email protected]
> >wrote:
>
> > I think I need the sort of operation Jake described above  -
> > wherein I can call a function f on a vector of the whole matrix(the
> dataset
> > here, which is sparse) in a distributed fashion) I'll see this in detail
> > tomorrow. But any other pointers on this issue with reference to the
> > MAHOUT-375.diff update are very welcome.
> >
>

Re: [GSOC] Matrix Operations on HDFS

Reply via email to