For computing pageRank, however, I bet that memcache would actually slow you down by forcing you to have a smaller cluster.
For a batch program, latency is not the issue, aggregate throughput is. If you have a 50 node MR cluster, you should be able to very easily sustain a few GB/s in reading your data with typical 90% task/data locality. For the network sizes that you are talking about, you should be able to do pageRank iterations in pretty short time spans (I would guess about a few minutes for 100 million nodes, but that is without even the back of an envelope to write on). This is equivalent to database query times of less than a microsecond if you were probing such a database for connectivity information between nodes. It is also with a shared nothing architecture. On Fri, Jul 3, 2009 at 7:58 AM, Marcus Herou <[email protected]>wrote: > > memcached? why not store the intermediate values in the MR FS? > > Why do you think I chose memcached ? It was not due to the nice API... Doh! > Performance of course. It beats the hell out of HDFS for small temporarily > stored data (that's why it's called a cache and not FS).
