Hello, I am testing a freshly installed Oscar 2.1 cluster. The cluster is comprised of 64 dual Xeon nodes. Oscar was was loaded atop Redhat 7.3 and it is updated to current levels as on Mon 1/13, kernel is 2.4.18-19-7.xsmp. The interconnect is gigabit. The motherboards have Intel 82544 chip onboard. The switch is a Foundry FastIron 1500.Oscar installed without incident using network/pxe boot method.
The problem I am seeing is that during mpich jobs I get a condition where the job will freeze before completion, waiting for a number of nodes to complete. Eventually (3-4 mins) the mpirun kicks out the error message about p4 and buggy RSH programs. When is error is occuring I see *very* high network i/o on a small number of nodes. They are the same nodes that have failed to return a result. It is always an even number of nodes doing this "ping-pong". I did a tcpdump on some of the nodes during the exhibited behavior via an out of band serial console. Here is an excerpt of the tcpdump output: --------------start tcpdump output-------------- 13:43:19.673406 vn24.cluster.44260 > vn32.cluster.32802: S 2483550144:2483550144(0) win 5840 <mss 1460,sackOK,timestamp 3324624 0,nop,wscale 0> (DF) 13:43:19.673493 vn32.cluster.32802 > vn24.cluster.44260: R 0:0(0) ack 2483550145 win 0 (DF) 13:43:19.673538 vn24.cluster.44261 > vn32.cluster.32802: S 2482105596:2482105596(0) win 5840 <mss 1460,sackOK,timestamp 3324624 0,nop,wscale 0> (DF) 13:43:19.673590 vn32.cluster.32802 > vn24.cluster.44261: R 0:0(0) ack 2482105597 win 0 (DF) 13:43:19.673640 vn24.cluster.44262 > vn32.cluster.32802: S 2488148110:2488148110(0) win 5840 <mss 1460,sackOK,timestamp 3324624 0,nop,wscale 0> (DF) 13:43:19.673693 vn32.cluster.32802 > vn24.cluster.44262: R 0:0(0) ack 2488148111 win 0 (DF) 13:43:19.673743 vn24.cluster.44263 > vn32.cluster.32802: S 2473501000:2473501000(0) win 5840 <mss 1460,sackOK,timestamp 3324624 0,nop,wscale 0> (DF) 13:43:19.673800 vn32.cluster.32802 > vn24.cluster.44263: R 0:0(0) ack 2473501001 win 0 (DF) -------------end tcpdump output--------------- The other nodes doing this looping or ping-pong thing have the same type of tcpdump output. It just does this over and over again until finally mpirun errors out: ------mpirun error msg------ Timeout in waiting for processes to exit, 2 left. This may be due to a defective rsh program (Some versions of Kerberos rsh have been observed to have this problem). This is not a problem with P4 or MPICH but a problem with the operating environment. For many applications, this problem will only slow down process termination. All operations completed. p23_1996: p4_error: Timeout in establishing connection to remote process: 0 p22_2036: p4_error: Timeout in establishing connection to remote process: 0 ----------------------------- I haven't seen this before so I am hoping someone else has. I have gone over hardware stuff pertty thoroughly and found no problems. The Foundry switch shows no tx/rx or any other errors during these occurances, or at anytime for that matter. If it was the switch or cabling there would be errors in other places (nfs, rsync, etc) and there aren't any. The locations of the errors are sporadic as well. There is never a node consistently involved with the error conditions when they occur. I appreciate any who have selflessly taken the time to read this. I am grateful for any ideas or suggestions you may have. Jeff __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com ------------------------------------------------------- This SF.NET email is sponsored by: Thawte.com Understand how to protect your customers personal information by implementing SSL on your Apache Web Server. Click here to get our FREE Thawte Apache Guide: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0029en _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
