I did not watch large hadoop clusters closely but from my experience of
other large clusters that have heavy disk loads (seek dominated), the
behavior you see seems consistent. Some disks do become very slow and if
they are on some raid, whole raid runs at the speed of the slowest disk.
iostat -x also helps confirm this.
Also comparing ext2 and ext3, ext3 did not have noticeable slow down.
Many times application access patterns tend dictate most of the disk
performance than the native filesystem implementation itself. Filesystem
would probably matter more when we are dealing with lot of small files.
Dennis Kubes wrote:
Can anyone who is running large clusters (50+) tell me what you are
seeing with hard disk failure rates. Something that we are seeing is
that certain machines will consistently have double or triple the load
of other machines with the same tasks. I believe that it is due to some
hard disks beginning to fail, just wanted to know if anyone else is
seeing similar behavior?
Dennis Kubes