Hi Shannon, On Fri, Jul 30, 2010 at 8:54 AM, Shannon Quinn <[email protected]> wrote:
1) If I'm in a Mapper, and I'm trying to access two matrices of data (the > rows of one of them form the VectorWritables that are the input to the > Mapper; the other is a Path argument to the cache), how could I access the > same row in both matrices simultaneously? My first instinct is to use the > IntWritable key input and simply access that same row from the saved Path, > but I'm not sure how the SequenceFile index schemes are set up. For > example, > if I have two DistributedRowMatrices, would the same key reference the same > row in both? > Accessing a separate SequenceFile from within a Mapper is *way inefficient* (orders of magnitude slower). You want to do a map-side join. This is what is done in MatrixMultiplyJob - your Mapper gets IntWritable as key, and the value is a Pair of VectorWritables - one from each matrix. > 2) I looked through the Mahout math package and nothing stood out: is there > an easy way for computing the median value of a Vector? Do you want the median of the non-zero entries (of a sparse vector), or the true median? Either way, there's not canned a canned impl of this on the Vector classes. It would probably be pretty nice to have an efficient (linear-time) implementation of this, however. -jake
