I'm seeing an odd storage performance problem that I hope can be fixed with the 
right configuration parameter, but nothing I've tried so far helps.  These 
tests were done in a virtual machine running on ESX, but earlier tests on 
native RHEL showed something similar.

Common configuration:
7 nodes with 10 GbE interconnect.
Each node: 2 socket Westmere, 96 GB, 10 local SATA disks exported to the VM as 
JBODs, single 92 GB VM.
TestDFSIO: 140 files, 7143 MB each (about 1 TB total data), so 2 map tasks per 
disk.  Replication=2.

Case A:  RHEL 5.5, EXT3 file system, write through configured on the physical 
disk
Case B:  RHEL 6.1, EXT4 FS, write back

Testing with aio-stress shows that the changes made in Case B all improved 
efficiency and performance.  But running the write test of TestDFSIO on hadoop 
(using CDH3u0) got worse:

Case A:  580 seconds exec time
Case B:  740 seconds

I can improve Case B to 710 seconds by going back to EXT3, or by mounting EXT4 
with min_batch_time=2000, so slowing down the FS improves hadoop performance.

Both cases show a peak write throughput of about 550 MB/s on each node.  The 
difference is that Case A the throughput is steady and doesn't drop below 500 
MB/s, but in B it is very noisy, sometimes going all the way to 0.  It is also 
sometimes periodic, rising and falling with a 15-30 second period.  That period 
is synchronized across all the nodes.  550 MB/s appears to be a controller 
limit, each disk alone is capable of 130 MB/s (with a raw partition or EXT4, 
EXT3 is about 100 MB/s).  I tried replication=1 to eliminate nearly all 
networking, but storage throughput was still not steady.

I'm thinking that faster storage somehow confuses the scheduler, but I don't 
see what the mechanism is.  Any ideas what's going on or things to try?  I 
don't want to have to recommend de-tuning storage in order to get hadoop to 
behave.

Thanks for the help,

Jeff

Reply via email to