sad...@gmx.net wrote:
Pak Lui schrieb:
sad...@gmx.net wrote:
Sorry for late reply, but I havent had access to the machine at the weekend.
I don't really know what this means. People have explained "loose"
vs. "tight" integration to me before, but since I'm not an SGE user,
the definitions always fall away.
I *assume* loose coupled jobs, are just jobs, where the SGE find some
nodes to process them and from then on, it doesn't care about anything
in conjunction to the jobs. In contrast to tight coupled jobs, where the
SGE take care for sub process which could spwan from the job and
terminate them too in case of a failure, or take care of specified
resources.
Based on your prior e-mail, it looks like you are always invoking
"ulimit" via "pdsh", even under SGE jobs. This is incorrect.
why?
Can't you just submit an SGE job script that runs "ulimit"?
#!/bin/csh -f
#$ -N MPI_Job
#$ -pe mpi 4
hostname && ulimit -a
ATM I'm quite confused: cause I want to use the c-shell, but ulimit is
just for bash. The c-shell uses limit... hmm.. and SGE uses obviously
bash, instead of my request for csh in the first line. But if I just use
#!/bin/bash I get the same limits:
-sh-3.00$ cat MPI_Job.o112116
node02
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
pending signals (-i) 1024
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 139264
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
oops => 32 kbytes... So this isn't OMPI's fault.
this looks like sge_execd isn't able to source the correct system
defaults from the limit.conf file after you applied the change. Maybe
you will need to restart the daemon?
Yes I posted the same question to the sun grid engine mailing list, and
as Jeff initially supposed it was the inproper limits for the daemons
(sgeexec). So I've to edit each node's init script
(/etc/init.d/sgeexecd), and put "ulimit -l unlimited" before starting
sge_execd. Then kill all sgeexecd's (running jobs won't be affected if
you use "qconf -ke all") then restart every node's sgeexecd. After that
every thing with SGE and OMPI 1.1.1 was fine.
But for the whole question just read the small thread at:
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=20390
At this point big thanks to Jeff, and all other which helped me!
Are there any suggestions to the compilation error?
Are you referring to this SEGV error here? I am assuming this is OMPI
1.1.1 so you are using rsh PLS to launch your executables (using loose
integration).
>-sh-3.00$ ompi/bin/mpirun -d -np 2 -H node03,node06 hostname
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] connect_uni: connection not allowed
> [headnode:23178] [0,0,0] setting up session dir with
> [headnode:23178] universe default-universe-23178
> [headnode:23178] user me
> [headnode:23178] host headnode
> [headnode:23178] jobid 0
> [headnode:23178] procid 0
> [headnode:23178] procdir:
> /tmp/openmpi-sessions-me@headnode_0/default-universe-23178/0/0
> [headnode:23178] jobdir:
> /tmp/openmpi-sessions-me@headnode_0/default-universe-23178/0
> [headnode:23178] unidir:
> /tmp/openmpi-sessions-me@headnode_0/default-universe-23178
> [headnode:23178] top: openmpi-sessions-me@headnode_0
> [headnode:23178] tmp: /tmp
> [headnode:23178] [0,0,0] contact_file
> /tmp/openmpi-sessions-me@headnode_0/default-universe-23178/universe-
> setup.txt
> [headnode:23178] [0,0,0] wrote setup file
> [headnode:23178] *** Process received signal ***
> [headnode:23178] Signal: Segmentation fault (11)
> [headnode:23178] Signal code: Address not mapped (1)
> [headnode:23178] Failing at address: 0x1
> [headnode:23178] [ 0] /lib64/tls/libpthread.so.0 [0x39ed80c430]
> [headnode:23178] [ 1] /lib64/tls/libc.so.6(strcmp+0) [0x39ecf6ff00]
> [headnode:23178] [ 2]
> /home/me/ompi/lib/openmpi/mca_pls_rsh.so(orte_pls_rsh_launch+0x24f)
> [0x2a9723cc7f]
> [headnode:23178] [ 3] /home/me/ompi/lib/openmpi/mca_rmgr_urm.so
> [0x2a9764fa90]
> [headnode:23178] [ 4] /home/me/ompi/bin/mpirun(orterun+0x35b)
> [0x402ca3]
> [headnode:23178] [ 5] /home/me/ompi/bin/mpirun(main+0x1b) [0x402943]
> [headnode:23178] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> [0x39ecf1c3fb]
> [headnode:23178] [ 7] /home/me/ompi/bin/mpirun [0x40289a]
> [headnode:23178] *** End of error message ***
> Segmentation fault
So is it true that SEGV only occurred under the SGE environment and not
a normal environment? If it is then I am baffled because starting rsh
pls under the SGE environment in 1.1.1 should be no different than
starting rsh pls without SGE.
There seems to be only one strcmp that can fail in the
orte_pls_rsh_launch(). I can only assume there is some memory corruption
when getting either ras_node->node_name or orte_system_info.nodename for
strcmp.
https://svn.open-mpi.org/trac/ompi/browser/tags/v1.1-series/v1.1.1/orte/mca/pls/rsh/pls_rsh_module.c
Maybe a way to workaround it is by using a more recent OMPI version. A
lot of things in ORTE has been revamped since 1.1 so I would encourage
you to try a more recent OMPI since there maybe some fixes that probably
didn't get brought over to 1.1. Plus with 1.2 you should be able to use
the tight integration with the gridengine module there.
many many thousand thanks for the great help here in the forum!
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
- Pak Lui
pak....@sun.com