I just wanted to add to this one other published benchmark http://developer.yahoo.net/blogs/hadoop/2008/09/scaling_hadoop_to_4000_nodes_a.html In this example on a very busy cluster of 4000 nodes both read and write throughputs were close to the local disk bandwidth. This benchmark (called TestDFSIO) uses large consequent write and reads. You can run it yourself on your hardware to compare.
Is it more efficient to unify the disks into one volume (RAID or LVM), and then present them as a single space? Or it's better to specify each disk separately?
There was a discussion recently on this list about RAID0 vs separate disks. Please search the archives. Separate disks turn out to perform better.
Reliability-wise, the latter sounds more correct, as a single/several (up to 3) disks going down won't take the whole node with them. But perhaps there is a performance penalty?
You always have block replicas on other nodes, so one node going down should not be a problem. Thanks, --Konstantin
