We are having mucho problems after upgrading a 250-node cluster to Red Hat 7.2
 
We have 5 racks of 50 nodes all plugged into extreme switches. The nodes have onboard NIC's using the EEPRO100 driver (the NIC's are i82557/i82558)
We are using kernel 2.4.13 (and must since it is the only kernel our clustering software supports, we are using MOSIX)
 
The switches are configured properly and allow all protocols and multicast. Here are some of the errors we are getting when moving large amounts of data: (by large I mean many small files each about 50K, we are using the cluster to do image analysis)
 
1) ifconfig reports large numbers of collisions
eth0      Link encap:Ethernet  HWaddr 00:E0:81:01:80:C0 
          inet addr:10.0.0.2  Bcast:10.0.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:28335076 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5993190 errors:0 dropped:0 overruns:0 carrier:9130
          collisions:39577
          RX bytes:2376002511 (2265.9 Mb)  TX bytes:646896700 (616.9 Mb)
 
 
2) We are getting message like these on some nodes:
23443(remote): Arrival rejected due to severe memory shortage.
23449(remote): Arrival rejected due to severe memory shortage.
eth0: card reports no resources.
eth0: card reports no resources.
 
I am assuming there is no faulty hardware since we are talking about 5 switches and 250 nodes all with similiar problems.
 
THanks for any help,
Chuck
 

Reply via email to