Last week in installed a Gigabit switch and some 3c2000-T 3com gigabit network cards
on my little cluster with an exsisting OSCAR 3.0 and RH9.0 setup which previously
worked. I used the drivers that came with the cards on cd and got internet
conectivity with the card on my head node as long as I turned off the other interface.
So I was happy, plugged back into my old onboard ethernet cards and went back to
work. Everything looked fine and I successfully ran some LAM/MPI programs I was
working on.
Today I installed NetPIPE in an effort to see what effect the switch had on my network
performance and was suprised that it crashed if I tried to run it on more than one
node. So I checked Ganglia only to find that it was reading all my nodes as down
despite the fact that I had just successfully lambooted them. I ran the program on
the two processors on my head node without any problem using the same lamboot file. I
had shutdown my cluster over the weekend so I checked lsmod to see if the gigabit
module was still in and it was not, so I tried insmoding the gigabit module back in on
the head and all compute nodes. Ganglia magically can see all the nodes again.
NetPIPE still gets the same error.
Anyway, my question is: Would installing this gigabit card cause this type of
unexpected system behavior and if so, how might I fix it? From what I have seen
NetPIPE should be totally mindless to install and run, but has anyone used it with
OSCAR successfully (or unsuccessfully?).
Thanks for any help y'all can give.
**Error Message dump from NPmpi**
[]$mpirun -np 16 NPmpi
0: oscarhost
4: oscarnode002
2: oscarnode001
8: oscarnode004
6: oscarnode003
3: oscarnode001
10: oscarnode005
12: oscarnode006
5: oscarnode002
9: oscarnode004
7: oscarnode003
14: oscarnode007
11: oscarnode005
15: oscarnode007
13: oscarnode006
1: oscarhost
Now starting the main loop
0: 1 bytes 951 times --> MPI_Recv: process in local group is dead (rank 1,
SSI:coll:smp:local comm for CID 0)
Rank (15, MPI_COMM_WORLD): Call stack within LAM:
Rank (15, MPI_COMM_WORLD): - MPI_Recv()
Rank (15, MPI_COMM_WORLD): - MPI_Bcast()
Rank (15, MPI_COMM_WORLD): - MPI_Barrier()
Rank (15, MPI_COMM_WORLD): - main()
0.09 Mbps in 83.57 usec
MPI_Recv: process in local group is dead (rank 0, SSI:coll:smp:coord comm for CID 0)
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Recv()
Rank (0, MPI_COMM_WORLD): - MPI_Gather()
Rank (0, MPI_COMM_WORLD): - MPI_Barrier()
Rank (0, MPI_COMM_WORLD): - main()
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 -
digital self defense, top technical experts, no vendor pitches,
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users