[Hadoop Wiki] Update of "DiskSetup" by SteveLoughran

Apache Wiki Wed, 20 May 2009 06:03:24 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/DiskSetup

The comment on the change is:
new page on disks for Hadoop.

New page:
= Setting up Disks for Hadoop =

Here are some recommendations for setting up disks in a Hadoop cluster. What we 
have here is anecdotal -hard evidence is very welcome, and everyone should 
expect a bit of trial and error work.

== Key Points ==

Goals for a Hadoop cluster are normally massive amounts of data with high I/O 
bandwidth. Your MapReduce jobs may be IO bound or CPU/Memory bound -if you know 
which one is more important (effectively how many CPU cycles/RAM MB used per 
Map or Reduce), you can make better decisions.

== Hardware ==

You don't need RAID disk controllers for Hadoop, as it copies data across 
multiple machines instead. This increase the likelihood that there is a free 
task slot near that data, and if the servers are on different PSUs and 
switches, eliminates some more points of failure in the datacenter.

Having lots of disks per server gives you more raw IO bandwidth than having one 
or two big disks. If you have enough that different tasks can be using 
different disks for input and output, disk seeking is minimized, which is one 
of the big disk performance killers. That said: more disks have a higher power 
budget; if you are power limited, you may want fewer but larger disks.

== Configuring Hadoop ==

Pass a list of disks to the dfs.data.dir parameter, Hadoop will use all of the 
disk that are available.

== Underlying File System ==

=== Ext3 ===

It's widely believed that Yahoo! use ext3. Regardless of the merits of the 
filesystem, that means that HDFS-on-ext3 has been publicly tested at a bigger 
scale than any other underlying filesystem.

=== XFS ===

>From Bryan on the core-user list on 19 May 2009:

 We use XFS for our data drives, and we've had somewhat mixed results. One of 
the biggest pros is that XFS has more free space than ext3, even with the 
reserved space settings turned all the way to 0. Another is that you can format 
a 1TB drive as XFS in about 0 seconds, versus minutes for ext3. This makes it 
really fast to kickstart our worker nodes.

 We have seen some weird stuff happen though when machines run out of memory, 
apparently because the XFS driver does something odd with kernel memory. When 
this happens, we end up having to do some fscking before we can get that node 
back online.

 As far as outright performance, I actually *did* do some tests of xfs vs ext3 
performance on our cluster. If you just look at a single machine's local disk 
speed, you can write and read noticeably faster when using XFS instead of ext3. 
However, the reality is that this extra disk performance won't have much of an 
effect on your overall job completion performance, since you will find yourself 
network bottlenecked well in advance of even ext3's performance.

 The long and short of it is that we use XFS to speed up our new machine 
deployment, and that's it.

[Hadoop Wiki] Update of "DiskSetup" by SteveLoughran

Reply via email to