Re: [OMPI devel] (loose) SGE Integration fails, why?

Jeff Squyres Fri, 22 Jun 2007 14:45:53 -0400

On Jun 22, 2007, at 10:44 AM, sad...@gmx.net wrote:

Can you send more information on this?  See http://www.open-mpi.org/
community/help/


-sh-3.00$ ompi/bin/mpirun -d -np 2 -H node03,node06 hostname
[headnode:23178] connect_uni: connection not allowed
[headnode:23178] connect_uni: connection not allowed
[headnode:23178] connect_uni: connection not allowed
[headnode:23178] connect_uni: connection not allowed
[headnode:23178] connect_uni: connection not allowed
[headnode:23178] connect_uni: connection not allowed
[headnode:23178] connect_uni: connection not allowed
[headnode:23178] connect_uni: connection not allowed
[headnode:23178] connect_uni: connection not allowed
[headnode:23178] connect_uni: connection not allowed
[headnode:23178] [0,0,0] setting up session dir with
[headnode:23178]        universe default-universe-23178
[headnode:23178]        user me
[headnode:23178]        host headnode
[headnode:23178]        jobid 0
[headnode:23178]        procid 0
[headnode:23178] procdir:
/tmp/openmpi-sessions-me@headnode_0/default-universe-23178/0/0
[headnode:23178] jobdir:
/tmp/openmpi-sessions-me@headnode_0/default-universe-23178/0
[headnode:23178] unidir:
/tmp/openmpi-sessions-me@headnode_0/default-universe-23178
[headnode:23178] top: openmpi-sessions-me@headnode_0
[headnode:23178] tmp: /tmp
[headnode:23178] [0,0,0] contact_file

/tmp/openmpi-sessions-me@headnode_0/default-universe-23178/universe-setup.txt

[headnode:23178] [0,0,0] wrote setup file
[headnode:23178] *** Process received signal ***
[headnode:23178] Signal: Segmentation fault (11)
[headnode:23178] Signal code: Address not mapped (1)
[headnode:23178] Failing at address: 0x1
[headnode:23178] [ 0] /lib64/tls/libpthread.so.0 [0x39ed80c430]
[headnode:23178] [ 1] /lib64/tls/libc.so.6(strcmp+0) [0x39ecf6ff00]
[headnode:23178] [ 2]
/home/me/ompi/lib/openmpi/mca_pls_rsh.so(orte_pls_rsh_launch+0x24f)
[0x2a9723cc7f]
[headnode:23178] [ 3] /home/me/ompi/lib/openmpi/mca_rmgr_urm.so
[0x2a9764fa90]

[headnode:23178] [ 4] /home/me/ompi/bin/mpirun(orterun+0x35b)[0x402ca3]

[headnode:23178] [ 5] /home/me/ompi/bin/mpirun(main+0x1b) [0x402943]
[headnode:23178] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
[0x39ecf1c3fb]
[headnode:23178] [ 7] /home/me/ompi/bin/mpirun [0x40289a]
[headnode:23178] *** End of error message ***
Segmentation fault

This should not happen -- this is [obviously] even before any MPIprocessing starts. Are you inside an SGE job here?


Pak/Ralph: any ideas?

Launch an SGE job that calls the shell command "limit" (if you run C-
shell variants) or "ulimit -l" (if you run Bourne shell variants).
Ensure that the output is "unlimited".


I've done that allready, but how to distinguish between tight coupled
job ulimits and loose coupled job ulimits? I tested to pass

$TMPDIR/machines to a shell script which in turn delivers a "ulimit-a",

*assuming* this is considered as a tight coupled job, but each node
returned unlimited.. and without this $TMPDIR/machines too. Even the
headnode is set to unlimited.

I don't really know what this means. People have explained "loose"vs. "tight" integration to me before, but since I'm not an SGE user,the definitions always fall away.

Based on your prior e-mail, it looks like you are always invoking"ulimit" via "pdsh", even under SGE jobs. This is incorrect. Can'tyou just submit an SGE job script that runs "ulimit"?

What are the limits of the user that launches the SGE daemons?  I.e.,
did the SGE daemons get started with proper "unlimited" limits?  If
not, that could hamper SGE's ability to set the limits that you told


The limits in /etc/security/limits.conf apply to all users (using a
'*'), hence the SGE processes and deamons shouldn't have any limits.

Not really. limits.conf is not universally applied; it's a PAMentity. So for daemons that start via /etc/init.d scripts (orwhatever the equivalent is on your system), PAM limits are notnecessarily applied. For example, I had to manually insert a "ulimit-Hl unlimited" in the startup script for my SLURM daemons.


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] (loose) SGE Integration fails, why?

Reply via email to