The errors that you are seeing are typical for when one process dies during the run and another process then tries to communicate with that process. Specifically, it's happening during a 2-level collective (barrier), which is really odd -- that shouldn't happen. Could there have been a network glitch where a socket was closed and MPI therefore deduced that the remote process(es) were dead?
On Mon, 28 Jun 2004, Michael Edwards wrote:
Last week in installed a Gigabit switch and some 3c2000-T 3com gigabit network cards on my little cluster with an exsisting OSCAR 3.0 and RH9.0 setup which previously worked. I used the drivers that came with the cards on cd and got internet conectivity with the card on my head node as long as I turned off the other interface. So I was happy, plugged back into my old onboard ethernet cards and went back to work. Everything looked fine and I successfully ran some LAM/MPI programs I was working on.
Today I installed NetPIPE in an effort to see what effect the switch had on my network performance and was suprised that it crashed if I tried to run it on more than one node. So I checked Ganglia only to find that it was reading all my nodes as down despite the fact that I had just successfully lambooted them. I ran the program on the two processors on my head node without any problem using the same lamboot file. I had shutdown my cluster over the weekend so I checked lsmod to see if the gigabit module was still in and it was not, so I tried insmoding the gigabit module back in on the head and all compute nodes. Ganglia magically can see all the nodes again. NetPIPE still gets the same error.
Anyway, my question is: Would installing this gigabit card cause this type of unexpected system behavior and if so, how might I fix it? From what I have seen NetPIPE should be totally mindless to install and run, but has anyone used it with OSCAR successfully (or unsuccessfully?).
Thanks for any help y'all can give.
**Error Message dump from NPmpi**
[]$mpirun -np 16 NPmpi
0: oscarhost
4: oscarnode002
2: oscarnode001
8: oscarnode004
6: oscarnode003
3: oscarnode001
10: oscarnode005
12: oscarnode006
5: oscarnode002
9: oscarnode004
7: oscarnode003
14: oscarnode007
11: oscarnode005
15: oscarnode007
13: oscarnode006
1: oscarhost
Now starting the main loop
0: 1 bytes 951 times --> MPI_Recv: process in local group is dead (rank 1, SSI:coll:smp:local comm for CID 0) Rank (15, MPI_COMM_WORLD): Call stack within LAM:
Rank (15, MPI_COMM_WORLD): - MPI_Recv()
Rank (15, MPI_COMM_WORLD): - MPI_Bcast()
Rank (15, MPI_COMM_WORLD): - MPI_Barrier()
Rank (15, MPI_COMM_WORLD): - main()
0.09 Mbps in 83.57 usec
MPI_Recv: process in local group is dead (rank 0, SSI:coll:smp:coord comm for CID 0) Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Recv()
Rank (0, MPI_COMM_WORLD): - MPI_Gather()
Rank (0, MPI_COMM_WORLD): - MPI_Barrier()
Rank (0, MPI_COMM_WORLD): - main()
------------------------------------------------------- This SF.Net email sponsored by Black Hat Briefings & Training. Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
-- {+} Jeff Squyres {+} [EMAIL PROTECTED] {+} http://www.lam-mpi.org/
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users
