Shannon,

Nice work so far.

I think it is a bit more customary to enter a graph by giving the integer
pairs that represent the starting and ending nodes for each arc.  That
avoids the memory allocation problem you hit if one node is connected to
millions of others.  It also may solve your problem of the distributed row
matrix since you could write a reducer to gather everything to the right
place for writing a row.  In doing that, you would inherently have the row
number available because that would be the grouping key.

If you keep the current one matrix row per csv line, I would recommend
putting the source node at the beginning of the line.


On Tue, Jun 15, 2010 at 3:58 PM, Shannon Quinn <[email protected]> wrote:

>
> 1) I've made the assumption so far that the input to my clustering
> algorithm will be a single CSV file containing the entire affinity matrix,
> where each line in the file is a row in the matrix. Is there another input
> approach that would work better for reading this affinity matrix?
>
>

Reply via email to