Re: [OMPI devel] (loose) SGE Integration fails, why?

Pak Lui Mon, 25 Jun 2007 11:12:30 -0400

sad...@gmx.net wrote:

Pak Lui schrieb:

sad...@gmx.net wrote:

Sorry for late reply, but I havent had access to the machine at the weekend.

I don't really know what this means. People have explained "loose"vs. "tight" integration to me before, but since I'm not an SGE user,the definitions always fall away.

I *assume* loose coupled jobs, are just jobs, where the SGE find some
nodes to process them and from then on, it doesn't care about anything
in conjunction to the jobs. In contrast to tight coupled jobs, where the
SGE take care for sub process which could spwan from the job and
terminate them too in case of a failure, or take care of specified
resources.

Based on your prior e-mail, it looks like you are always invoking"ulimit" via "pdsh", even under SGE jobs. This is incorrect.

why?

Can't you just submit an SGE job script that runs "ulimit"?

#!/bin/csh -f
#$ -N MPI_Job
#$ -pe mpi 4
hostname && ulimit -a

ATM I'm quite confused: cause I want to use the c-shell, but ulimit is
just for bash. The c-shell uses limit... hmm.. and SGE uses obviously
bash, instead of my request for csh in the first line. But if I just use
#!/bin/bash I get the same limits:

-sh-3.00$ cat MPI_Job.o112116
node02
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 1024
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 139264
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

oops => 32 kbytes... So this isn't OMPI's fault.

this looks like sge_execd isn't able to source the correct systemdefaults from the limit.conf file after you applied the change. Maybeyou will need to restart the daemon?


Yes I posted the same question to the sun grid engine mailing list, and
as Jeff initially supposed it was the inproper limits for the daemons
(sgeexec). So I've to edit each node's init script
(/etc/init.d/sgeexecd), and put "ulimit -l unlimited" before starting
sge_execd. Then kill all sgeexecd's (running jobs won't be affected if
you use "qconf -ke all") then restart every node's sgeexecd. After that
every thing with SGE and OMPI 1.1.1 was fine.

But for the whole question just read the small thread at:
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=20390

At this point big thanks to Jeff, and all other which helped me!

Are there any suggestions to the compilation error?

Are you referring to this SEGV error here? I am assuming this is OMPI1.1.1 so you are using rsh PLS to launch your executables (using looseintegration).


>-sh-3.00$ ompi/bin/mpirun -d -np 2 -H node03,node06 hostname
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] [0,0,0] setting up session dir with
> [headnode:23178]        universe default-universe-23178
> [headnode:23178]        user me
> [headnode:23178]        host headnode
> [headnode:23178]        jobid 0
> [headnode:23178]        procid 0
> [headnode:23178] procdir:
> /tmp/openmpi-sessions-me@headnode_0/default-universe-23178/0/0
> [headnode:23178] jobdir:
> /tmp/openmpi-sessions-me@headnode_0/default-universe-23178/0
> [headnode:23178] unidir:
> /tmp/openmpi-sessions-me@headnode_0/default-universe-23178
> [headnode:23178] top: openmpi-sessions-me@headnode_0
> [headnode:23178] tmp: /tmp
> [headnode:23178] [0,0,0] contact_file
> /tmp/openmpi-sessions-me@headnode_0/default-universe-23178/universe-
> setup.txt
> [headnode:23178] [0,0,0] wrote setup file
> [headnode:23178] *** Process received signal ***
> [headnode:23178] Signal: Segmentation fault (11)
> [headnode:23178] Signal code: Address not mapped (1)
> [headnode:23178] Failing at address: 0x1
> [headnode:23178] [ 0] /lib64/tls/libpthread.so.0 [0x39ed80c430]
> [headnode:23178] [ 1] /lib64/tls/libc.so.6(strcmp+0) [0x39ecf6ff00]
> [headnode:23178] [ 2]
> /home/me/ompi/lib/openmpi/mca_pls_rsh.so(orte_pls_rsh_launch+0x24f)
> [0x2a9723cc7f]
> [headnode:23178] [ 3] /home/me/ompi/lib/openmpi/mca_rmgr_urm.so
> [0x2a9764fa90]
> [headnode:23178] [ 4] /home/me/ompi/bin/mpirun(orterun+0x35b)
> [0x402ca3]
> [headnode:23178] [ 5] /home/me/ompi/bin/mpirun(main+0x1b) [0x402943]
> [headnode:23178] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> [0x39ecf1c3fb]
> [headnode:23178] [ 7] /home/me/ompi/bin/mpirun [0x40289a]
> [headnode:23178] *** End of error message ***
> Segmentation fault

So is it true that SEGV only occurred under the SGE environment and nota normal environment? If it is then I am baffled because starting rshpls under the SGE environment in 1.1.1 should be no different thanstarting rsh pls without SGE.

There seems to be only one strcmp that can fail in theorte_pls_rsh_launch(). I can only assume there is some memory corruptionwhen getting either ras_node->node_name or orte_system_info.nodename forstrcmp.


https://svn.open-mpi.org/trac/ompi/browser/tags/v1.1-series/v1.1.1/orte/mca/pls/rsh/pls_rsh_module.c

Maybe a way to workaround it is by using a more recent OMPI version. Alot of things in ORTE has been revamped since 1.1 so I would encourageyou to try a more recent OMPI since there maybe some fixes that probablydidn't get brought over to 1.1. Plus with 1.2 you should be able to usethe tight integration with the gridengine module there.


many many thousand thanks for the great help here in the forum!
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--


- Pak Lui
pak....@sun.com

Re: [OMPI devel] (loose) SGE Integration fails, why?

Reply via email to