Ted, thanks for your reply. This is still in an early phase of research so I wouldn't like to spend much time on the infrastructure (I need to sleep also ;) . The simplest possible solution that will work is ok for now. I'll wait for Ed's implementation.
Your mail actually made me think about my perception of map-reduce model and hadoop implementation. I was thinking that most of the time hadoop should protect me from worrying about data access time, bandwidth, etc. Even if that means the computation will be, lets say, number of times slower as it would be in the optimal implementation. I assume you're probably talking about the optimal one, or at least the good one, and I agree with you. Of course, hadoop can't hide this completely, I'd still have to follow some guide lines (use optimal number of mappers/reduces, use combiners, make splits large enough so that a mapper can work for couple of minutes, and so on). Hadoop should try to cut down the bandwidth (by spawning a mapper close to the data, etc). Ordinary matrix multiplication makes is difficult because each element from one matrix will by multiplied by all the elements from the other matrix. Unfortunately, not all problems, like word counting for example, are splittable in a way that data moving between nodes is not required. This is probably single-machine-developer inside me complaining :) I have to consider better ways to partition my problem(s) eventually... Again, thanks for your mail. I have few more words for you privately. -- regards, Milan