Thanks for the heads-up, Owen. Do you know how long it took to run the
application? And how many files were processed? I am particularly eager
to know the answer to the second question.

I found an article at http://developer.yahoo.net/blogs/hadoop/2008/09/,
where the total number of cores used was over 30,000. The number of
files in that benchmark run was 14,000. The reported average throughput
for read was 18MB/s on 500 nodes and 66MB/s on 4000 nodes. It was
explained in the article (underneath Table 1) that:

        "The 4000-node cluster throughput was 7 times better than 500's
for writes and 3.6 times better for reads even though the bigger cluster
carried more (4 v/s 2 tasks) per node load than the smaller one."

Is 66MB/s the aggregated read throughput or per-node throughput? If the
latter were the case, the aggregated bandwidth would have been 4000 x
66MB/s = 264 GB/s, and the speedup on 4000 nodes over 500 nodes should
have been (66/18) * (4000/500) = 28.8. 


-----Original Message-----
From: Owen O'Malley [mailto:[email protected]] 
Sent: Friday, April 03, 2009 5:20 PM
To: [email protected]
Subject: Re: Hadoop/HDFS for scientific simulation output data analysis

On Apr 3, 2009, at 1:41 PM, Tu, Tiankai wrote:

> By the way, what is the largest size---in terms of total bytes, number
> of files, and number of nodes---in your applications? Thanks.

The largest Hadoop application that has been documented is the Yahoo  
Webmap.

10,000 cores
500 TB shuffle
300 TB compressed final output

http://developer.yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-pro
duction-hadoop.html

-- Owen

Reply via email to