M/R capturing line numbers in text files

Shannon Quinn Tue, 15 Jun 2010 15:58:35 -0700

Hi all,

I have a few questions on the specifics of map/reduce:

1) I've made the assumption so far that the input to my clusteringalgorithm will be a single CSV file containing the entire affinitymatrix, where each line in the file is a row in the matrix. Is thereanother input approach that would work better for reading this affinitymatrix?

2) I've committed a patch for what the M/R task of creating aDistributedRowMatrix out of the input data might look like, but it'sunfinished. There isn't a straightforward way of determining which rowin the CSV file is currently being processed (as the keys are the numberof bytes, rather than the line number), and it's crucial that lines inthe CSV file correspond to rows in the DistributedRowMatrix. I've founda few ways to handle this, but they're either too hacky (adding a columnto the CSV file) or very in-depth (subclassing RecordReader), so Ithought I'd ask if anyone else has thoughts on this?

3) Once I am able to track which rows are which, how can I make sure theSequenceFiles are written in such a way so that the ensuingDistributedRowMatrix accurately reflects the arrangement of data inoriginal CSV file? I've been using TransposeJob as a model for this, butit seems to work with the advantage that the keys in the Map stepalready correspond to rows. The syntheticcontrol InputMapper has alsobeen useful, but in this case the clustering algorithms don't need tokeep the rows in any particular orientation.


Thanks again for all the assistance :)

Regards,
Shannon

M/R capturing line numbers in text files

Reply via email to