Jeff Squyres schrieb: >> Hmm, I've heard about conflicts with OMPI 1.2.x and OFED 1.1 (sorry no >> refference here), > > I'm unaware of any problems with OMPI 1.2.x and OFED 1.1. I run OFED > 1.1 on my cluster at Cisco and have many different versions of OMPI > installed (1.2, trunk, etc.).
Yes you are right, I read wrong (in the OMPI 1.2 changelog (README) OFED 1.0 isn't considered to work with OMPI 1.2. Sorry..). >> and I've got no luck producing a working OMPI >> installation ("mpirun --help" runs, and ./IMB-MPI compiles and runs >> too, >> but "mpirun -np 2 node03,node14 IMB-MPI1" doesnt (segmentation >> fault))... > > Can you send more information on this? See http://www.open-mpi.org/ > community/help/ -sh-3.00$ ompi/bin/mpirun -d -np 2 -H node03,node06 hostname [headnode:23178] connect_uni: connection not allowed [headnode:23178] connect_uni: connection not allowed [headnode:23178] connect_uni: connection not allowed [headnode:23178] connect_uni: connection not allowed [headnode:23178] connect_uni: connection not allowed [headnode:23178] connect_uni: connection not allowed [headnode:23178] connect_uni: connection not allowed [headnode:23178] connect_uni: connection not allowed [headnode:23178] connect_uni: connection not allowed [headnode:23178] connect_uni: connection not allowed [headnode:23178] [0,0,0] setting up session dir with [headnode:23178] universe default-universe-23178 [headnode:23178] user me [headnode:23178] host headnode [headnode:23178] jobid 0 [headnode:23178] procid 0 [headnode:23178] procdir: /tmp/openmpi-sessions-me@headnode_0/default-universe-23178/0/0 [headnode:23178] jobdir: /tmp/openmpi-sessions-me@headnode_0/default-universe-23178/0 [headnode:23178] unidir: /tmp/openmpi-sessions-me@headnode_0/default-universe-23178 [headnode:23178] top: openmpi-sessions-me@headnode_0 [headnode:23178] tmp: /tmp [headnode:23178] [0,0,0] contact_file /tmp/openmpi-sessions-me@headnode_0/default-universe-23178/universe-setup.txt [headnode:23178] [0,0,0] wrote setup file [headnode:23178] *** Process received signal *** [headnode:23178] Signal: Segmentation fault (11) [headnode:23178] Signal code: Address not mapped (1) [headnode:23178] Failing at address: 0x1 [headnode:23178] [ 0] /lib64/tls/libpthread.so.0 [0x39ed80c430] [headnode:23178] [ 1] /lib64/tls/libc.so.6(strcmp+0) [0x39ecf6ff00] [headnode:23178] [ 2] /home/me/ompi/lib/openmpi/mca_pls_rsh.so(orte_pls_rsh_launch+0x24f) [0x2a9723cc7f] [headnode:23178] [ 3] /home/me/ompi/lib/openmpi/mca_rmgr_urm.so [0x2a9764fa90] [headnode:23178] [ 4] /home/me/ompi/bin/mpirun(orterun+0x35b) [0x402ca3] [headnode:23178] [ 5] /home/me/ompi/bin/mpirun(main+0x1b) [0x402943] [headnode:23178] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x39ecf1c3fb] [headnode:23178] [ 7] /home/me/ompi/bin/mpirun [0x40289a] [headnode:23178] *** End of error message *** Segmentation fault >> yes I allready read the faq, and even setting them to unlimited has >> shown not be working. In the SGE one could specify the limits to >> SGE-jobs by e.g. the qmon tool, (configuring queues > select queue > >> modify > limits) But there is everything set to infinity. (Beside >> that, >> the job is running with a static machinefile (is this an >> "noninteractive" job?)) How could I test ulimits of interactive and >> noninteractive jobs? > > Launch an SGE job that calls the shell command "limit" (if you run C- > shell variants) or "ulimit -l" (if you run Bourne shell variants). > Ensure that the output is "unlimited". I've done that allready, but how to distinguish between tight coupled job ulimits and loose coupled job ulimits? I tested to pass $TMPDIR/machines to a shell script which in turn delivers a "ulimit -a", *assuming* this is considered as a tight coupled job, but each node returned unlimited.. and without this $TMPDIR/machines too. Even the headnode is set to unlimited. > What are the limits of the user that launches the SGE daemons? I.e., > did the SGE daemons get started with proper "unlimited" limits? If > not, that could hamper SGE's ability to set the limits that you told The limits in /etc/security/limits.conf apply to all users (using a '*'), hence the SGE processes and deamons shouldn't have any limits. > it to via qmon (remember my disclaimer: I know nothing about SGE, so > this is speculation). But thanks anyway => I will post this issue to an SGE mailing list soon. The config.log and the `ompi_info --all` is attached. Thanks again to all of you.
logs.tbz
Description: application/bzip-compressed-tar