For computing pageRank, however, I bet that memcache would actually slow you
down by forcing you to have a smaller cluster.

For a batch program, latency is not the issue, aggregate throughput is.  If
you have a 50 node MR cluster, you should be able to very easily sustain a
few GB/s in reading your data with typical 90% task/data locality.  For the
network sizes that you are talking about, you should be able to do pageRank
iterations in pretty short time spans (I would guess about a few minutes for
100 million nodes, but that is without even the back of an envelope to write
on).  This is equivalent to database query times of less than a microsecond
if you were probing such a database for connectivity information between
nodes.  It is also with a shared nothing architecture.

On Fri, Jul 3, 2009 at 7:58 AM, Marcus Herou <[email protected]>wrote:

> > memcached? why not store the intermediate values in the MR FS?
>
> Why do you think I chose memcached ? It was not due to the nice API... Doh!
> Performance of course. It beats the hell out of HDFS for small temporarily
> stored data (that's why it's called a cache and not FS).

Reply via email to