Hello,

I am testing a freshly installed Oscar 2.1 cluster.
The cluster is comprised of 64 dual Xeon nodes. Oscar
was was loaded atop Redhat 7.3 and it is updated to
current levels as on Mon 1/13, kernel is
2.4.18-19-7.xsmp. The interconnect is gigabit. The
motherboards have Intel 82544 chip onboard. The switch
is a Foundry FastIron 1500.Oscar installed without
incident using network/pxe boot method.

The problem I am seeing is that during mpich jobs I
get a condition where the job will freeze before
completion, waiting for a number of nodes to complete.
Eventually (3-4 mins) the mpirun kicks out the error
message about p4 and buggy RSH programs. 

When is error is occuring I see *very* high network
i/o on a small number of nodes. They are the same
nodes that have failed to return a result. It is
always an even number of nodes doing this "ping-pong".
I did a tcpdump on some of the nodes during the
exhibited behavior via an out of band serial console.
Here is an excerpt of the tcpdump output:

--------------start tcpdump output--------------
13:43:19.673406 vn24.cluster.44260 >
vn32.cluster.32802: S 2483550144:2483550144(0) win
5840 <mss 1460,sackOK,timestamp 3324624 0,nop,wscale
0> (DF)
13:43:19.673493 vn32.cluster.32802 >
vn24.cluster.44260: R 0:0(0) ack 2483550145 win 0 (DF)
13:43:19.673538 vn24.cluster.44261 >
vn32.cluster.32802: S 2482105596:2482105596(0) win
5840 <mss 1460,sackOK,timestamp 3324624 0,nop,wscale
0> (DF)
13:43:19.673590 vn32.cluster.32802 >
vn24.cluster.44261: R 0:0(0) ack 2482105597 win 0 (DF)
13:43:19.673640 vn24.cluster.44262 >
vn32.cluster.32802: S 2488148110:2488148110(0) win
5840 <mss 1460,sackOK,timestamp 3324624 0,nop,wscale
0> (DF)
13:43:19.673693 vn32.cluster.32802 >
vn24.cluster.44262: R 0:0(0) ack 2488148111 win 0 (DF)
13:43:19.673743 vn24.cluster.44263 >
vn32.cluster.32802: S 2473501000:2473501000(0) win
5840 <mss 1460,sackOK,timestamp 3324624 0,nop,wscale
0> (DF)
13:43:19.673800 vn32.cluster.32802 >
vn24.cluster.44263: R 0:0(0) ack 2473501001 win 0 (DF)
-------------end tcpdump output---------------

The other nodes doing this looping or ping-pong thing
have the same type of tcpdump output. It just does
this over and over again until finally mpirun errors
out:

------mpirun error msg------
Timeout in waiting for processes to exit, 2 left. 
This may be due to a defective rsh program (Some
versions of Kerberos rsh have been observed to have
this problem).
This is not a problem with P4 or MPICH but a problem
with the operating environment.  For many
applications, this problem will only slow down process
termination.

All operations completed. 
p23_1996:  p4_error: Timeout in establishing
connection to remote process: 0
p22_2036:  p4_error: Timeout in establishing
connection to remote process: 0
-----------------------------

I haven't seen this before so I am hoping someone else
has. I have gone over hardware stuff pertty thoroughly
and found no problems. The Foundry switch shows no
tx/rx or any other errors during these occurances, or
at anytime for that matter. If it was the switch or
cabling there would be errors in other places (nfs,
rsync, etc) and there aren't any.

The locations of the errors are sporadic as well.
There is never a node consistently involved with the
error conditions when they occur.

I appreciate any who have selflessly taken the time to
read this. I am grateful for any ideas or suggestions
you may have.

Jeff

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com


-------------------------------------------------------
This SF.NET email is sponsored by: Thawte.com
Understand how to protect your customers personal information by implementing
SSL on your Apache Web Server. Click here to get our FREE Thawte Apache 
Guide: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0029en
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to