Last week in installed a Gigabit switch and some 3c2000-T 3com gigabit network cards 
on my little cluster with an exsisting OSCAR 3.0 and RH9.0 setup which previously 
worked.  I used the drivers that came with the cards on cd and got internet 
conectivity with the card on my head node as long as I turned off the other interface. 
 So I was happy, plugged back into my old onboard ethernet cards and went back to 
work.  Everything looked fine and I successfully ran some LAM/MPI programs I was 
working on.

Today I installed NetPIPE in an effort to see what effect the switch had on my network 
performance and was suprised that it crashed if I tried to run it on more than one 
node.  So I checked Ganglia only to find that it was reading all my nodes as down 
despite the fact that I had just successfully lambooted them.  I ran the program on 
the two processors on my head node without any problem using the same lamboot file.  I 
had shutdown my cluster over the weekend so I checked lsmod to see if the gigabit 
module was still in and it was not, so I tried insmoding the gigabit module back in on 
the head and all compute nodes.  Ganglia magically can see all the nodes again.  
NetPIPE still gets the same error.

Anyway, my question is: Would installing this gigabit card cause this type of 
unexpected system behavior and if so, how might I fix it?  From what I have seen 
NetPIPE should be totally mindless to install and run, but has anyone used it with 
OSCAR successfully (or unsuccessfully?).

Thanks for any help y'all can give.

**Error Message dump from NPmpi**
[]$mpirun -np 16 NPmpi
0: oscarhost
4: oscarnode002
2: oscarnode001
8: oscarnode004
6: oscarnode003
3: oscarnode001
10: oscarnode005
12: oscarnode006
5: oscarnode002
9: oscarnode004
7: oscarnode003
14: oscarnode007
11: oscarnode005
15: oscarnode007
13: oscarnode006
1: oscarhost
Now starting the main loop
  0:       1 bytes    951 times --> MPI_Recv: process in local group is dead (rank 1, 
SSI:coll:smp:local comm for CID 0)
Rank (15, MPI_COMM_WORLD): Call stack within LAM:
Rank (15, MPI_COMM_WORLD):  - MPI_Recv()
Rank (15, MPI_COMM_WORLD):  - MPI_Bcast()
Rank (15, MPI_COMM_WORLD):  - MPI_Barrier()
Rank (15, MPI_COMM_WORLD):  - main()
     0.09 Mbps in      83.57 usec
MPI_Recv: process in local group is dead (rank 0, SSI:coll:smp:coord comm for CID 0)
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD):  - MPI_Recv()
Rank (0, MPI_COMM_WORLD):  - MPI_Gather()
Rank (0, MPI_COMM_WORLD):  - MPI_Barrier()
Rank (0, MPI_COMM_WORLD):  - main()



-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 -
digital self defense, top technical experts, no vendor pitches,
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to