Hi Shannon,

On Fri, Jul 30, 2010 at 8:54 AM, Shannon Quinn <[email protected]> wrote:

1) If I'm in a Mapper, and I'm trying to access two matrices of data (the
> rows of one of them form the VectorWritables that are the input to the
> Mapper; the other is a Path argument to the cache), how could I access the
> same row in both matrices simultaneously? My first instinct is to use the
> IntWritable key input and simply access that same row from the saved Path,
> but I'm not sure how the SequenceFile index schemes are set up. For
> example,
> if I have two DistributedRowMatrices, would the same key reference the same
> row in both?
>

Accessing a separate SequenceFile from within a Mapper is *way inefficient*
(orders of magnitude slower).

You want to do a map-side join.  This is what is done in MatrixMultiplyJob
-
your Mapper gets IntWritable as key, and the value is a Pair of
VectorWritables -
one from each matrix.


> 2) I looked through the Mahout math package and nothing stood out: is there
> an easy way for computing the median value of a Vector?


Do you want the median of the non-zero entries (of a sparse vector), or the
true median?  Either way, there's not canned a canned impl of this on the
Vector classes.  It would probably be pretty nice to have an efficient
(linear-time) implementation of this, however.

  -jake

Reply via email to