Hi,

I'm commissioning a new Hadoop cluster with the following spec.

45 x data nodes:
- 2 x Quad-Core AMD Opteron(tm) Processor 2378
- 16GB ram
- 4 x WDC WD1002FBYS 1TB SATA drives (configured as separate ext4 filesystems)

3 x name nodes:
- 2 x Quad-Core AMD Opteron(tm) Processor 2378
- 32GB ram
- 2 x WDC WD1002FBYS 1TB SATA drives (in software RAID1 config and ext4 filesystem)

All nodes are running Debian testing/squeeze.

I'm doing my benchmarking with TeraSort running as follows

hadoop jar hadoop-0.20.2-examples.jar teragen -Dmapred.map.tasks=8000 10000000000 /terasort/in

hadoop jar hadoop-0.20.2-examples.jar terasort -Dmapred.reduce.tasks=530 /terasort/in /terasort/out

When I run this on the Debian 2.6.30 kernel - it runs to completion in about 23 minutes (occasionally running into the cpu soft lockups problems described in [1]). I assume that is a reasonable time for this benchmark to complete in?

When I run this on the Debian 2.6.32 kernel - over the course of the run, 1 or 2 datanodes of the cluster enter a state whereby they are no longer responsive to network traffic.

Logging into these nodes via the console reveals no messages in the log-files. Running ifdown eth0 followed by ifup eth0 brings these systems back online. The systems that become unresponsive vary from run to run suggesting this is not a h/w problem specific to certain nodes.

I have raised this issue with the Debian kernel team[2] and have tested
various system and switch changes in an attempt to identify the cause -
but without success.

Has anyone run into similar problems with their environments? I noticed that the when the nodes become unresponsive, it often happens when the TeraSort is at

map 100%, reduce 78%

Is there any significance to that?

Any feedback welcome (including comments on what distro/kernel combinations others are using).

Thanks,

-stephen

[1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=556030
[2] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=572201

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com

Reply via email to