Thanks for the heads-up, Owen. Do you know how long it took to run the application? And how many files were processed? I am particularly eager to know the answer to the second question.
I found an article at http://developer.yahoo.net/blogs/hadoop/2008/09/, where the total number of cores used was over 30,000. The number of files in that benchmark run was 14,000. The reported average throughput for read was 18MB/s on 500 nodes and 66MB/s on 4000 nodes. It was explained in the article (underneath Table 1) that: "The 4000-node cluster throughput was 7 times better than 500's for writes and 3.6 times better for reads even though the bigger cluster carried more (4 v/s 2 tasks) per node load than the smaller one." Is 66MB/s the aggregated read throughput or per-node throughput? If the latter were the case, the aggregated bandwidth would have been 4000 x 66MB/s = 264 GB/s, and the speedup on 4000 nodes over 500 nodes should have been (66/18) * (4000/500) = 28.8. -----Original Message----- From: Owen O'Malley [mailto:[email protected]] Sent: Friday, April 03, 2009 5:20 PM To: [email protected] Subject: Re: Hadoop/HDFS for scientific simulation output data analysis On Apr 3, 2009, at 1:41 PM, Tu, Tiankai wrote: > By the way, what is the largest size---in terms of total bytes, number > of files, and number of nodes---in your applications? Thanks. The largest Hadoop application that has been documented is the Yahoo Webmap. 10,000 cores 500 TB shuffle 300 TB compressed final output http://developer.yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-pro duction-hadoop.html -- Owen
