Re: [OMPI users] OpenMpi 1.1 and Torque 2.1.1

2006-06-30 Thread Justin Bronder

Greetings,

The bug with poll was fixed in the stable Torque 2.1.1 release, and I have
checked again
to make sure that pbsdsh does work.

jbronder@meldrew-linux ~/src/hpl $ qsub -I -q default -l nodes=4:ppn=2 -l
opsys=darwin
qsub: waiting for job 312.ldap1.meldrew.clusters.umaine.edu to start
qsub: job 312.ldap1.meldrew.clusters.umaine.edu ready

node96:~ jbronder$ pbsdsh uname -a
Darwin node96.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar  7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
Darwin node96.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar  7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
Darwin node94.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar  7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
Darwin node94.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar  7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
Darwin node95.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar  7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
Darwin node95.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar  7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
Darwin node93.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar  7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
Darwin node93.meldrew.clusters.umaine.edu 8.6.0 Darwin Kernel Version 8.6.0:
Tue Mar  7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power
Macintosh powerpc
node96:~ jbronder$

If there is anything else I should check, please let me know.

Thanks,

Justin Bronder.

On 6/30/06, Jeff Squyres (jsquyres)  wrote:


 There was a bug in early Torque 2.1.x versions (I'm afraid I don't
remember which one) that -- I think -- had something to do with a faulty
poll() implementation.  Whatever the problem was, it caused all TM launchers
to fail on OSX.

Can you see if the Torque-included tool pbsdsh works properly?  It uses
the same API that Open MPI does (the "tm" api).

If pbsdsh fails, I suspect you're looking at a Torque bug.  I know
that Garrick S. has since fixed the problem in the Torque code base; I don't
know if they've had a release since then that included the fix.

If pbsdsh works, let us know and we'll track this down further.

 --
*From:* users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] *On
Behalf Of *Justin Bronder
*Sent:* Thursday, June 29, 2006 5:19 PM
*To:* us...@open-mpi.org
*Subject:* [OMPI users] OpenMpi 1.1 and Torque 2.1.1

I'm having trouble getting OpenMPI to execute jobs when submitting through
Torque.
Everything works fine if I am to use the included mpirun scripts, but this
is obviously
not a good solution for the general users on the cluster.

I'm running under OS X 10.4, Darwin 8.6.0.  I configured OpenMpi with:
export CC=/opt/ibmcmp/vac/6.0/bin/xlc
export CXX=/opt/ibmcmp/vacpp/6.0/bin/xlc++
export FC=/opt/ibmcmp/xlf/8.1/bin/xlf90_r
export F77=/opt/ibmcmp/xlf/8.1/bin/xlf_r
export LDFLAGS=-lSystemStubs
export LIBTOOL=glibtool

PREFIX=/usr/local/ompi-xl

./configure \
--prefix=$PREFIX \
--with-tm=/usr/local/pbs/ \
--with-gm=/opt/gm \
--enable-static \
--disable-cxx

I also had to employ the fix listed in:
http://www.open-mpi.org/community/lists/users/2006/04/1007.php


I've attached the output of ompi_info while in an interactive job.
Looking through the list,
I can at least save a bit of trouble by listing what does work.  Anything
outside of Torque
seems fine.  From within an interactive job, pbsdsh works fine, hence the
earlier problems
with poll are fixed.

Here is the error that is reported when I attemt to run hostname on one
processor:
node96:/usr/src/openmpi-1.1 jbronder$ /usr/local/ompi-xl/bin/mpirun -np 1
-mca pls_tm_debug 1 /bin/hostname
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: final top-level argv:
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: orted
--no-daemonize --bootproxy 1 --name  --num_procs 2 --vpid_start 0
--nodename  --universe 
jbron...@node96.meldrew.clusters.umaine.edu:default-universe
--nsreplica "0.0.0;tcp://10.0.1.96:49395" --gprreplica "0.0.0
;tcp://10.0.1.96:49395"
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: Set
prefix:/usr/local/ompi-xl
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: launching on node
localhost
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: resetting PATH:
/usr/local/ompi-xl/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/pbs/bin:/usr/local/mpiexec/bin:/opt/ibmcmp/xlf/8.1/bin:/opt/ibmcmp/vac/6.0/bin:/opt/ibmcmp/vacpp/6.0/bin:/opt/gm/bin:/opt/fms/bin
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: found
/usr/local/ompi-xl/bin/orted
[node

Re: [OMPI users] OpenMpi 1.1 and Torque 2.1.1

2006-06-30 Thread Jeff Squyres (jsquyres)
There was a bug in early Torque 2.1.x versions (I'm afraid I don't
remember which one) that -- I think -- had something to do with a faulty
poll() implementation.  Whatever the problem was, it caused all TM
launchers to fail on OSX.  
 
Can you see if the Torque-included tool pbsdsh works properly?  It uses
the same API that Open MPI does (the "tm" api).  
 
If pbsdsh fails, I suspect you're looking at a Torque bug.  I know that
Garrick S. has since fixed the problem in the Torque code base; I don't
know if they've had a release since then that included the fix.
 
If pbsdsh works, let us know and we'll track this down further.




From: users-boun...@open-mpi.org
[mailto:users-boun...@open-mpi.org] On Behalf Of Justin Bronder
Sent: Thursday, June 29, 2006 5:19 PM
To: us...@open-mpi.org
        Subject: [OMPI users] OpenMpi 1.1 and Torque 2.1.1


I'm having trouble getting OpenMPI to execute jobs when
submitting through Torque.
Everything works fine if I am to use the included mpirun
scripts, but this is obviously
not a good solution for the general users on the cluster.

I'm running under OS X 10.4, Darwin 8.6.0.  I configured OpenMpi
with:
export CC=/opt/ibmcmp/vac/6.0/bin/xlc
export CXX=/opt/ibmcmp/vacpp/6.0/bin/xlc++
export FC=/opt/ibmcmp/xlf/8.1/bin/xlf90_r
export F77=/opt/ibmcmp/xlf/8.1/bin/xlf_r
export LDFLAGS=-lSystemStubs
export LIBTOOL=glibtool

PREFIX=/usr/local/ompi-xl

./configure \
--prefix=$PREFIX \
--with-tm=/usr/local/pbs/ \
--with-gm=/opt/gm \
--enable-static \
--disable-cxx

I also had to employ the fix listed in:
http://www.open-mpi.org/community/lists/users/2006/04/1007.php


I've attached the output of ompi_info while in an interactive
job.  Looking through the list,
I can at least save a bit of trouble by listing what does work.
Anything outside of Torque
seems fine.  From within an interactive job, pbsdsh works fine,
hence the earlier problems
with poll are fixed.

Here is the error that is reported when I attemt to run hostname
on one processor:
node96:/usr/src/openmpi-1.1 jbronder$
/usr/local/ompi-xl/bin/mpirun -np 1 -mca pls_tm_debug 1 /bin/hostname
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: final
top-level argv:
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: orted
--no-daemonize --bootproxy 1 --name  --num_procs 2 --vpid_start 0
--nodename  --universe
jbron...@node96.meldrew.clusters.umaine.edu:default-universe --nsreplica
"0.0.0;tcp://10.0.1.96:49395" --gprreplica "0.0.0;tcp://10.0.1.96:49395"
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: Set
prefix:/usr/local/ompi-xl
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: launching on
node localhost
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: resetting
PATH:
/usr/local/ompi-xl/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/pbs/bin:
/usr/local/mpiexec/bin:/opt/ibmcmp/xlf/8.1/bin:/opt/ibmcmp/vac/6.0/bin:/
opt/ibmcmp/vacpp/6.0/bin:/opt/gm/bin:/opt/fms/bin
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: found
/usr/local/ompi-xl/bin/orted
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: not
oversubscribed -- setting mpi_yield_when_idle to 0
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: executing:
orted --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 2
--vpid_start 0 --nodename localhost --universe
jbron...@node96.meldrew.clusters.umaine.edu:default-universe --nsreplica
"0.0.0;tcp://10.0.1.96:49395" --gprreplica "0.0.0;tcp://10.0.1.96:49395"
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: start_procs
returned error -13
[node96.meldrew.clusters.umaine.edu:00850] [0,0,0]
ORTE_ERROR_LOG: Not found in file rmgr_urm.c at line 184
[node96.meldrew.clusters.umaine.edu:00850] [0,0,0]
ORTE_ERROR_LOG: Not found in file rmgr_urm.c at line 432
[node96.meldrew.clusters.umaine.edu:00850] mpirun: spawn failed
with errno=-13
node96:/usr/src/openmpi-1.1 jbronder$ 


My thanks for any help in advance,

Justin Bronder.




[OMPI users] OpenMpi 1.1 and Torque 2.1.1

2006-06-29 Thread Justin Bronder

I'm having trouble getting OpenMPI to execute jobs when submitting through
Torque.
Everything works fine if I am to use the included mpirun scripts, but this
is obviously
not a good solution for the general users on the cluster.

I'm running under OS X 10.4, Darwin 8.6.0.  I configured OpenMpi with:
export CC=/opt/ibmcmp/vac/6.0/bin/xlc
export CXX=/opt/ibmcmp/vacpp/6.0/bin/xlc++
export FC=/opt/ibmcmp/xlf/8.1/bin/xlf90_r
export F77=/opt/ibmcmp/xlf/8.1/bin/xlf_r
export LDFLAGS=-lSystemStubs
export LIBTOOL=glibtool

PREFIX=/usr/local/ompi-xl

./configure \
   --prefix=$PREFIX \
   --with-tm=/usr/local/pbs/ \
   --with-gm=/opt/gm \
   --enable-static \
   --disable-cxx

I also had to employ the fix listed in:
http://www.open-mpi.org/community/lists/users/2006/04/1007.php


I've attached the output of ompi_info while in an interactive job.  Looking
through the list,
I can at least save a bit of trouble by listing what does work.  Anything
outside of Torque
seems fine.  From within an interactive job, pbsdsh works fine, hence the
earlier problems
with poll are fixed.

Here is the error that is reported when I attemt to run hostname on one
processor:
node96:/usr/src/openmpi-1.1 jbronder$ /usr/local/ompi-xl/bin/mpirun -np 1
-mca pls_tm_debug 1 /bin/hostname
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: final top-level argv:
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: orted --no-daemonize
--bootproxy 1 --name  --num_procs 2 --vpid_start 0 --nodename  --universe
jbron...@node96.meldrew.clusters.umaine.edu:default-universe --nsreplica "
0.0.0;tcp://10.0.1.96:49395" --gprreplica "0.0.0;tcp://10.0.1.96:49395"
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: Set
prefix:/usr/local/ompi-xl
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: launching on node
localhost
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: resetting PATH:
/usr/local/ompi-xl/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/pbs/bin:/usr/local/mpiexec/bin:/opt/ibmcmp/xlf/8.1/bin:/opt/ibmcmp/vac/6.0/bin:/opt/ibmcmp/vacpp/6.0/bin:/opt/gm/bin:/opt/fms/bin
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: found
/usr/local/ompi-xl/bin/orted
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: not oversubscribed --
setting mpi_yield_when_idle to 0
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: executing: orted
--no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0
--nodename localhost --universe
jbron...@node96.meldrew.clusters.umaine.edu:default-universe
--nsreplica "0.0.0;tcp://10.0.1.96:49395" --gprreplica "0.0.0
;tcp://10.0.1.96:49395"
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: start_procs returned
error -13
[node96.meldrew.clusters.umaine.edu:00850] [0,0,0] ORTE_ERROR_LOG: Not found
in file rmgr_urm.c at line 184
[node96.meldrew.clusters.umaine.edu:00850] [0,0,0] ORTE_ERROR_LOG: Not found
in file rmgr_urm.c at line 432
[node96.meldrew.clusters.umaine.edu:00850] mpirun: spawn failed with
errno=-13
node96:/usr/src/openmpi-1.1 jbronder$


My thanks for any help in advance,

Justin Bronder.


ompi_info.log.gz
Description: GNU Zip compressed data