Ah! That would be why its not working with 16 then. Thought I had tried it with two processes on different machines, but must have either not done it right or not gotten around to it.
Since they have some other error checks, it might not be a bad idea to check for np>2 since it seems that causes it to crash with unhelpful error messages. Of course, it doesn't seem to be a widespread user assumption. "mpirun N -np 2 NPmpi" works just fine. I was thinking it worked a bit differently than it actually does, and would do something a bit more like hpl only with slowly increasing packet size. It makes sense to do it this way since it is trying to just test network performance alone. I am still a bit confused as to why ganglia now wont work unless the gigabit module is loaded in, but that is a curriosity question and not maddening like a simple program not working. Thanks again for your patient help and reading my rambling posts. Original Message ----------------------- I thought that NetPIPE only used processes 0 and 1 for the ping pongs (no matter how many you ran)...? I could be wrong, though. The errors that you are seeing are typical for when one process dies during the run and another process then tries to communicate with that process. Specifically, it's happening during a 2-level collective (barrier), which is really odd -- that shouldn't happen. Could there have been a network glitch where a socket was closed and MPI therefore deduced that the remote process(es) were dead? On Mon, 28 Jun 2004, Michael Edwards wrote: > Last week in installed a Gigabit switch and some 3c2000-T 3com gigabit > network cards on my little cluster with an exsisting OSCAR 3.0 and RH9.0 > setup which previously worked. I used the drivers that came with the > cards on cd and got internet conectivity with the card on my head node > as long as I turned off the other interface. So I was happy, plugged > back into my old onboard ethernet cards and went back to work. > Everything looked fine and I successfully ran some LAM/MPI programs I > was working on. > > Today I installed NetPIPE in an effort to see what effect the switch had > on my network performance and was suprised that it crashed if I tried to > run it on more than one node. So I checked Ganglia only to find that it > was reading all my nodes as down despite the fact that I had just > successfully lambooted them. I ran the program on the two processors on > my head node without any problem using the same lamboot file. I had > shutdown my cluster over the weekend so I checked lsmod to see if the > gigabit module was still in and it was not, so I tried insmoding the > gigabit module back in on the head and all compute nodes. Ganglia > magically can see all the nodes again. NetPIPE still gets the same > error. > > Anyway, my question is: Would installing this gigabit card cause this > type of unexpected system behavior and if so, how might I fix it? From > what I have seen NetPIPE should be totally mindless to install and run, > but has anyone used it with OSCAR successfully (or unsuccessfully?). > > Thanks for any help y'all can give. > > **Error Message dump from NPmpi** > []$mpirun -np 16 NPmpi > 0: oscarhost > 4: oscarnode002 > 2: oscarnode001 > 8: oscarnode004 > 6: oscarnode003 > 3: oscarnode001 > 10: oscarnode005 > 12: oscarnode006 > 5: oscarnode002 > 9: oscarnode004 > 7: oscarnode003 > 14: oscarnode007 > 11: oscarnode005 > 15: oscarnode007 > 13: oscarnode006 > 1: oscarhost > Now starting the main loop > 0: 1 bytes 951 times --> MPI_Recv: process in local group is > dead (rank 1, SSI:coll:smp:local comm for CID 0) > Rank (15, MPI_COMM_WORLD): Call stack within LAM: > Rank (15, MPI_COMM_WORLD): - MPI_Recv() > Rank (15, MPI_COMM_WORLD): - MPI_Bcast() > Rank (15, MPI_COMM_WORLD): - MPI_Barrier() > Rank (15, MPI_COMM_WORLD): - main() > 0.09 Mbps in 83.57 usec > MPI_Recv: process in local group is dead (rank 0, SSI:coll:smp:coord > comm for CID 0) > Rank (0, MPI_COMM_WORLD): Call stack within LAM: > Rank (0, MPI_COMM_WORLD): - MPI_Recv() > Rank (0, MPI_COMM_WORLD): - MPI_Gather() > Rank (0, MPI_COMM_WORLD): - MPI_Barrier() > Rank (0, MPI_COMM_WORLD): - main() > > > > ------------------------------------------------------- > This SF.Net email sponsored by Black Hat Briefings & Training. > Attend Black Hat Briefings & Training, Las Vegas July 24-29 - > digital self defense, top technical experts, no vendor pitches, > unmatched networking opportunities. Visit www.blackhat.com > _______________________________________________ > Oscar-users mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/oscar-users > -- {+} Jeff Squyres {+} [EMAIL PROTECTED] {+} http://www.lam-mpi.org/ ------------------------------------------------------- This SF.Net email sponsored by Black Hat Briefings & Training. Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
