This is probably a Linux e1000 driver problem, but I figured I'd ask on this list whether anyone else has seen it.
We run distcc in a server cluster. Running a large distcc compilation on 4 to 8 cluster nodes via rsh causes about half of the nodes running distccd to crash, seemigly at random. The entire system goes down, the crashed nodes do not respond to ping etc. Sometimes the e1000 network driver module fails to start upon reboot. A second reboot always brings the system and the e1000 interface up correctly. The same distcc and OS version work ok on desktop systems that use fast ethernet and other network drivers. Each node is a dual Pentium 4 Xeon with Intel 82544GC Gigabit Ethernet Controller (rev 02) integrated on the motherboard. The OS is Fedora Core 1 (kernel 2.4.22, gcc 3.3.2, glibc 2.3.2). We repeated the crashes several times running distcc versions 2.11 and 2.16 and e1000 driver versions 5.1.13-k1 (included with the Fedora kernel) and 5.3.19 (the latest). The Fedora kernel package is the latest update for FC1: kernel-smp-2.4.22-1.2197.nptl. Otherwise the e1000 driver, even the older versions included in the Fedora kernel, has worked without problems on this hardware. Distcc seems to be the only application triggering a crash (if the problem really is e1000, that is). I have not tried running a uniprocessor kernel. I guess the problem might be specific to SMP. Has anyone experienced anything like this? Mikko __ distcc mailing list http://distcc.samba.org/ To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/distcc
