Todd Lipcon wrote:
Most likely a kernel bug. In previous versions of Debian there was a buggy
forcedeth driver, for example, that caused it to drop off the network in
high load. Who knows what new bug is in 2.6.32 which is brand spanking new.
Yes, it looks like it is a kernel bug alright (see thread on kernel
netdev at http://marc.info/?t=127094288900001&r=1&w=2 if interested). To
be fair, I don't think these bugs are confined to Debian - I did some
initial testing with Scientific Linux and also ran into problems with
forcedeth.
The overwhelming majority of production clusters run on RHEL 5.3 or RHEL 5.4
in my experience (I'm lumping CentOS 5.3/5.4 in with RHEL here). I know one
or two production clusters running Debian Lenny, but none running something
as new as what you're talking about.
This is useful info - much appreciated. I guess if we don't manage to
stabilise the current config we'll look at moving to one of those.
Hadoop doesn't exercise the new
features in very recent kernels, so there's no sense accepting instability -
just go with something old that works!
Sure, but I figured I'd go with a distro now that can be largely left
untouched for the next 2-3 years and Debian lenny felt that bit old for
that. I know RHEL/CentOS would fit that requirement also, will see. I'm
also interested in using DRBD in some of our nodes for redundancy,
again, running with a newer distro should reduce the pain of configuring
that.
Finally, I figured burning in our cluster was a good opportunity to give
back to the community and do some testing on their behalf.
With regard to our TeraSort benchmark time of ~23 minutes - is that in
the right ballpark for a cluster of 45 data nodes and a nn and 2nn?
Thanks,
-stephen
--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie http://webstar.deri.ie http://sindice.com