Shannon, Nice work so far.
I think it is a bit more customary to enter a graph by giving the integer pairs that represent the starting and ending nodes for each arc. That avoids the memory allocation problem you hit if one node is connected to millions of others. It also may solve your problem of the distributed row matrix since you could write a reducer to gather everything to the right place for writing a row. In doing that, you would inherently have the row number available because that would be the grouping key. If you keep the current one matrix row per csv line, I would recommend putting the source node at the beginning of the line. On Tue, Jun 15, 2010 at 3:58 PM, Shannon Quinn <[email protected]> wrote: > > 1) I've made the assumption so far that the input to my clustering > algorithm will be a single CSV file containing the entire affinity matrix, > where each line in the file is a row in the matrix. Is there another input > approach that would work better for reading this affinity matrix? > >
