We are having mucho problems after upgrading a
250-node cluster to Red Hat 7.2
We have 5 racks of 50 nodes all plugged into
extreme switches. The nodes have onboard NIC's using the EEPRO100 driver (the
NIC's are i82557/i82558)
We are using kernel 2.4.13 (and must since it is
the only kernel our clustering software supports, we are using
MOSIX)
The switches are configured properly and allow all
protocols and multicast. Here are some of the errors we are getting when moving
large amounts of data: (by large I mean many small files each about 50K, we
are using the cluster to do image analysis)
1) ifconfig reports large numbers of
collisions
eth0 Link
encap:Ethernet HWaddr 00:E0:81:01:80:C0
inet addr:10.0.0.2 Bcast:10.0.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:28335076 errors:0 dropped:0 overruns:0 frame:0 TX packets:5993190 errors:0 dropped:0 overruns:0 carrier:9130 collisions:39577 RX bytes:2376002511 (2265.9 Mb) TX bytes:646896700 (616.9 Mb) 2) We are getting message like these on some
nodes:
23443(remote): Arrival rejected due to severe memory shortage. 23449(remote): Arrival rejected due to severe memory shortage. eth0: card reports no resources.
eth0: card reports no resources. I am assuming there is no faulty hardware since we
are talking about 5 switches and 250 nodes all with similiar
problems.
THanks for any help,
Chuck
|