Todd Lipcon wrote:
Yes, it looks like it is a kernel bug alright (see thread on kernel netdev
at http://marc.info/?t=127094288900001&r=1&w=2 if interested). To be fair,
I don't think these bugs are confined to Debian - I did some initial testing
with Scientific Linux and also ran into problems with forcedeth.
Interesting, good find. I try to avoid forcedeth now and have heard the same
from ops people at various large linux deployments. Not sure why, but it's
traditionally had a lot of bugs/regressions.
FYI, the netdev guys have proposed a patch and initial testing indicates
it fixes the problem (and brings the TeraSort down to about 18 minutes,
so win win :)
I share similar feelings about forcedeth, particularly after this, but
then I'm also dubious about at least some broadcom chipsets and even
Intel have had their issues
(https://bugzilla.kernel.org/show_bug.cgi?id=11382) so maybe it's just
that all nic's suck.
Finally, I figured burning in our cluster was a good opportunity to give
back to the community and do some testing on their behalf.
Very admirable of you :) It is good to have some people running new kernels
to suss these issues out before the rest of us check out modern technology
;-)
It also means there aren't problems lurking for us in the future when we
get forced to newer kernels for support/maintenance issues. I also ran
into http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=556030 while
testing a 2.6.30 kernel which may be lurking in older kernels too (and
seems to have been fixed in 2.6.32) so there are perils to staying back
and going forward.
With regard to our TeraSort benchmark time of ~23 minutes - is that in the
right ballpark for a cluster of 45 data nodes and a nn and 2nn?
Yep, sounds about the right ballpark.
Cool, thanks for the feedback. I'm surprised that others didn't comment
on the TeraSort result - perhaps others use something else for
smoke-testing/benchmarking their Hadoop clusters? If so, anyone want to
suggest what they do use? It'd be nice to see a collection of TeraSort
results somewhere to get an idea of what cluster configs work well and
for people who want to sanity check a new cluster.
-stephen
--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie http://webstar.deri.ie http://sindice.com