Bernard Li wrote:
Hi Jim:
By default, OSCAR installs both LAM/MPI and MPICH. Both provides "mpirun". To switch between the two implementations, you will use the program called "switcher". You probably want to figure out which MPI implementation you are actually using for that particular user, whether it is MPICH or LAM/MPI. For instance: [EMAIL PROTECTED] scripts]$ switcher mpi --show
system:default=lam-7.0.6
user:exists=1


SUCCESS! Apologies for the lengthy email here, but I think the info is pertinent/useful.

ORIGINAL PROBLEM:
Initially reported this as a non-local (i.e. ldap) authenticated user could use mpirun successfully but could not run the same code via pbs / qsub. Quickly evolved into a local user could not run code via pbs / qsub while being able to use mpirun successfully.

INSTALLATION:
A small(poor man's) 16(compute)+1(head) PIII teaching cluster. Basic interconnect of 100MB ethernet switch. OS=Unpatched RHEL4. Using external authentication and autofs/auto.master for user homes. PBS is configured to only have mom running on the c-nodes.

SOLUTION:
The mpich/pbs combination that was installed with the OSCAR installation had some sort of issue with my installation. That is still unresolved.

STEPS TAKEN:
- downloaded the current stable of mpich
- compiled with:
export RSHCOMMAND=ssh
./configure --with-device=ch_p4 \
            --with-mpe \
            --prefix=/opt/mpich-1.2.7p1
- added a file to the prefix to be able to use switcher and added
  it to the DB.
- pushed the homegrown version to the nodes and ran switcher on all
- then as a regular user switched to the homegrown mpich and was able to
  run the code via pbs / qsub.
- logged in as a second ldap user to verify and did the following. Log of the session follows:
===========================
-bash-3.00$ switcher mpi --show
system:default=mpich-ch_p4-gcc-1.2.7
user:exists=1
-bash-3.00$ mpicc -o node-test node-test.c
-bash-3.00$ cat testrun.pbs
#!/bin/sh
#
#PBS -l nodes=6
#
mpirun -np 6 /home/tmac2/clib/node-test
-bash-3.00$ qsub testrun.pbs
24.master
-bash-3.00$ cat testrun.pbs.o24
p0_28782: (60.473654) Procgroup:
p0_28782: (60.473807) entry 0: node15.oscardomain 0 0 /home/tmac2/clib/node-test tmac2 p0_28782: (60.473846) entry 1: rhel4.ehpctc.intern 1 1 /home/tmac2/clib/node-test tmac2 p0_28782: (60.473868) entry 2: rhel4.ehpctc.intern 1 2 /home/tmac2/clib/node-test tmac2 p0_28782: (60.473888) entry 3: rhel4.ehpctc.intern 1 3 /home/tmac2/clib/node-test tmac2 p0_28782: (60.473909) entry 4: rhel4.ehpctc.intern 1 4 /home/tmac2/clib/node-test tmac2 p0_28782: (60.473930) entry 5: rhel4.ehpctc.intern 1 5 /home/tmac2/clib/node-test tmac2 p0_28782: p4_error: Could not gethostbyname for host rhel4.ehpctc.intern; may be invalid name
: 61
-bash-3.00$ switcher mpi = mpich-1.2.7p1
Attribute successfully set; new attribute setting will be effective for
future shells

NOTE:  Here I logged out and back in.

-bash-3.00$ which mpicc
/opt/mpich-1.2.7p1/bin/mpicc
-bash-3.00$ mpicc -o node-test node-test.c
-bash-3.00$ qsub testrun.pbs
25.master
-bash-3.00$ cat testrun.pbs.o25
Message Received was: node1.oscardomain
Length of message=17
Status.Source=1
Status.Tag=50
Status.Error=0
Message Received was: node2.oscardomain
Length of message=17
Status.Source=2
Status.Tag=50
Status.Error=0
Message Received was: node3.oscardomain
Length of message=17
Status.Source=3
Status.Tag=50
Status.Error=0
Message Received was: node4.oscardomain
Length of message=17
Status.Source=4
Status.Tag=50
Status.Error=0
Message Received was: node5.oscardomain
Length of message=17
Status.Source=5
Status.Tag=50
Status.Error=0
===========================

Next I reviewed the logs left from step8-Testing of the cluster install.
The mpich test:
====
[EMAIL PROTECTED] mpich]# cat mpichtest.out
Running MPICH test

--> MPI C bindings test:

Process 0 of 16 on node15.oscardomain
1000 iterations: pi is approx. 3.1415927370900438, error = 0.0000000835002507
wall clock time = 0.000117
Process 6 of 16 on node8.oscardomain
Process 14 of 16 on node16.oscardomain
Process 8 of 16 on node6.oscardomain
Process 7 of 16 on node7.oscardomain
Process 15 of 16 on node12.oscardomain
Process 11 of 16 on node3.oscardomain
Process 3 of 16 on node11.oscardomain
Process 2 of 16 on node13.oscardomain
Process 10 of 16 on node4.oscardomain
Process 1 of 16 on node14.oscardomain
Process 9 of 16 on node5.oscardomain
Process 4 of 16 on node10.oscardomain
Process 12 of 16 on node2.oscardomain
Process 13 of 16 on node1.oscardomain
Process 5 of 16 on node9.oscardomain

--> MPI C++ bindings test:

Hello World! I am 0 of 16
Hello World! I am 1 of 16
Hello World! I am 2 of 16
Hello World! I am 15 of 16
Hello World! I am 3 of 16
Hello World! I am 4 of 16
Hello World! I am 5 of 16
Hello World! I am 6 of 16
Hello World! I am 7 of 16
Hello World! I am 9 of 16
Hello World! I am 8 of 16
Hello World! I am 10 of 16
Hello World! I am 13 of 16
Hello World! I am 12 of 16
Hello World! I am 14 of 16
Hello World! I am 11 of 16

--> MPI Fortran bindings test:

 Hello World! I am  12 of  16
 Hello World! I am  14 of  16
 Hello World! I am  8 of  16
 Hello World! I am  10 of  16
 Hello World! I am  2 of  16
 Hello World! I am  6 of  16
 Hello World! I am  4 of  16
 Hello World! I am  13 of  16
 Hello World! I am  11 of  16
 Hello World! I am  3 of  16
 Hello World! I am  7 of  16
 Hello World! I am  5 of  16
 Hello World! I am  15 of  16
 Hello World! I am  9 of  16
 Hello World! I am  1 of  16
 Hello World! I am  0 of  16
MPICH test complete
Unless there are errors above, test completed successfully.
====

And the PBS/torque test:
====
cat shelltest.out
node15.oscardomain
node14.oscardomain
node13.oscardomain
node11.oscardomain
node10.oscardomain
node9.oscardomain
node8.oscardomain
node7.oscardomain
node6.oscardomain
node5.oscardomain
node4.oscardomain
node3.oscardomain
node2.oscardomain
node1.oscardomain
node16.oscardomain
node12.oscardomain
Hello, date is 01/26/06, time is 10:21:42
Hello, date is 01/26/06, time is 10:21:42
Hello, date is 01/26/06, time is 10:21:42
Hello, date is 01/26/06, time is 10:21:42
Hello, date is 01/26/06, time is 10:21:42
Hello, date is 01/26/06, time is 10:21:42
Hello, date is 01/26/06, time is 10:21:42
Hello, date is 01/26/06, time is 10:21:42
Hello, date is 01/26/06, time is 10:21:43
Hello, date is 01/26/06, time is 10:21:43
Hello, date is 01/26/06, time is 10:21:43
Hello, date is 01/26/06, time is 10:21:43
Hello, date is 01/26/06, time is 10:21:43
Hello, date is 01/26/06, time is 10:21:43
Hello, date is 01/26/06, time is 10:21:43
Hello, date is 01/26/06, time is 10:21:43
====

Looking at the src rpm for mpich, the configure command only specifies a couple of extra things, so I am not sure if that might be what is causing the issue.

I need to freeze testing here for a while since it is now working. This small cluster is used in a grad level course here to teach some parallel concepts. I should be able to do more testing if needed at the end of the semester.

Hopefully this will help someone else if they stumble onto the same problem.

Thanks to all for all of the suggestions.

--
Jim Summers
School of Computer Science-University of Oklahoma
-------------------------------------------------


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to