Hi Ralph,
Just trying to understand - why are you saying this is a pmix problem?
Obviously, something to do with mpirun is failing, but I don't see any
indication here that it has to do with pmix.
No -- 4.0.3 had the pmix problem -- whenever I tried to submit jobs
across multiple nodes using slurm (i.e. -mca plm slurm), I'd get this:
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
But the same would work if I submitted it with rsh (i.e. -mca plm rsh).
I read online that there were issues with cpu bind so I thought 4.1.0
might have resolved it.
So, back to the problem at hand. I reconfigured with --enable-debug and
this is what I get:
andrej@terra:~/system/openmpi-4.1.0$ mpirun
[terra:4145441] *** Process received signal ***
[terra:4145441] Signal: Segmentation fault (11)
[terra:4145441] Signal code: (128)
[terra:4145441] Failing at address: (nil)
[terra:4145441] [ 0]
/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f487ebf4210]
[terra:4145441] [ 1]
/usr/local/lib/openmpi/mca_pmix_pmix3x.so(opal_pmix_pmix3x_check_evars+0x15c)[0x7f487a340b3c]
[terra:4145441] [ 2]
/usr/local/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_server_init+0x496)[0x7f487a3422e6]
[terra:4145441] [ 3]
/usr/local/lib/libopen-rte.so.40(pmix_server_init+0x5da)[0x7f487ef2f5ec]
[terra:4145441] [ 4]
/usr/local/lib/openmpi/mca_ess_hnp.so(+0x58d5)[0x7f487e90a8d5]
[terra:4145441] [ 5]
/usr/local/lib/libopen-rte.so.40(orte_init+0x354)[0x7f487efab836]
[terra:4145441] [ 6]
/usr/local/lib/libopen-rte.so.40(orte_submit_init+0x123b)[0x7f487efad0cd]
[terra:4145441] [ 7] mpirun(+0x16bc)[0x55d26c3bb6bc]
[terra:4145441] [ 8] mpirun(+0x134d)[0x55d26c3bb34d]
[terra:4145441] [ 9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f487ebd50b3]
[terra:4145441] [10] mpirun(+0x126e)[0x55d26c3bb26e]
[terra:4145441] *** End of error message ***
Segmentation fault (core dumped)
gdb backtrace:
(gdb) r
Starting program: /usr/local/bin/mpirun
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff3302b3c in opal_pmix_pmix3x_check_evars () from
/usr/local/lib/openmpi/mca_pmix_pmix3x.so
(gdb) bt
#0 0x00007ffff3302b3c in opal_pmix_pmix3x_check_evars () from
/usr/local/lib/openmpi/mca_pmix_pmix3x.so
#1 0x00007ffff33042e6 in pmix3x_server_init () from
/usr/local/lib/openmpi/mca_pmix_pmix3x.so
#2 0x00007ffff7ef15ec in pmix_server_init () at
orted/pmix/pmix_server.c:296
#3 0x00007ffff78cc8d5 in rte_init () at ess_hnp_module.c:329
#4 0x00007ffff7f6d836 in orte_init (pargc=0x7fffffffddbc,
pargv=0x7fffffffddb0, flags=4) at runtime/orte_init.c:271
#5 0x00007ffff7f6f0cd in orte_submit_init (argc=1, argv=0x7fffffffe478,
opts=0x0) at orted/orted_submit.c:570
#6 0x00005555555556bc in orterun (argc=1, argv=0x7fffffffe478) at
orterun.c:136
#7 0x000055555555534d in main (argc=1, argv=0x7fffffffe478) at main.c:13
This build is using the latest openpmix from github master.
Thanks,
Andrej