Hi all,

I have a few questions on the specifics of map/reduce:

1) I've made the assumption so far that the input to my clustering algorithm will be a single CSV file containing the entire affinity matrix, where each line in the file is a row in the matrix. Is there another input approach that would work better for reading this affinity matrix?

2) I've committed a patch for what the M/R task of creating a DistributedRowMatrix out of the input data might look like, but it's unfinished. There isn't a straightforward way of determining which row in the CSV file is currently being processed (as the keys are the number of bytes, rather than the line number), and it's crucial that lines in the CSV file correspond to rows in the DistributedRowMatrix. I've found a few ways to handle this, but they're either too hacky (adding a column to the CSV file) or very in-depth (subclassing RecordReader), so I thought I'd ask if anyone else has thoughts on this?

3) Once I am able to track which rows are which, how can I make sure the SequenceFiles are written in such a way so that the ensuing DistributedRowMatrix accurately reflects the arrangement of data in original CSV file? I've been using TransposeJob as a model for this, but it seems to work with the advantage that the keys in the Map step already correspond to rows. The syntheticcontrol InputMapper has also been useful, but in this case the clustering algorithms don't need to keep the rows in any particular orientation.

Thanks again for all the assistance :)

Regards,
Shannon

Reply via email to