Hi Ralph,

Just trying to understand - why are you saying this is a pmix problem? 
Obviously, something to do with mpirun is failing, but I don't see any 
indication here that it has to do with pmix.

No -- 4.0.3 had the pmix problem -- whenever I tried to submit jobs across multiple nodes using slurm (i.e. -mca plm slurm), I'd get this:

--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

But the same would work if I submitted it with rsh (i.e. -mca plm rsh). I read online that there were issues with cpu bind so I thought 4.1.0 might have resolved it.

So, back to the problem at hand. I reconfigured with --enable-debug and this is what I get:

andrej@terra:~/system/openmpi-4.1.0$ mpirun
[terra:4145441] *** Process received signal ***
[terra:4145441] Signal: Segmentation fault (11)
[terra:4145441] Signal code:  (128)
[terra:4145441] Failing at address: (nil)
[terra:4145441] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f487ebf4210] [terra:4145441] [ 1] /usr/local/lib/openmpi/mca_pmix_pmix3x.so(opal_pmix_pmix3x_check_evars+0x15c)[0x7f487a340b3c] [terra:4145441] [ 2] /usr/local/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_server_init+0x496)[0x7f487a3422e6] [terra:4145441] [ 3] /usr/local/lib/libopen-rte.so.40(pmix_server_init+0x5da)[0x7f487ef2f5ec] [terra:4145441] [ 4] /usr/local/lib/openmpi/mca_ess_hnp.so(+0x58d5)[0x7f487e90a8d5] [terra:4145441] [ 5] /usr/local/lib/libopen-rte.so.40(orte_init+0x354)[0x7f487efab836] [terra:4145441] [ 6] /usr/local/lib/libopen-rte.so.40(orte_submit_init+0x123b)[0x7f487efad0cd]
[terra:4145441] [ 7] mpirun(+0x16bc)[0x55d26c3bb6bc]
[terra:4145441] [ 8] mpirun(+0x134d)[0x55d26c3bb34d]
[terra:4145441] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f487ebd50b3]
[terra:4145441] [10] mpirun(+0x126e)[0x55d26c3bb26e]
[terra:4145441] *** End of error message ***
Segmentation fault (core dumped)

gdb backtrace:

(gdb) r
Starting program: /usr/local/bin/mpirun
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff3302b3c in opal_pmix_pmix3x_check_evars () from /usr/local/lib/openmpi/mca_pmix_pmix3x.so
(gdb) bt
#0  0x00007ffff3302b3c in opal_pmix_pmix3x_check_evars () from /usr/local/lib/openmpi/mca_pmix_pmix3x.so #1  0x00007ffff33042e6 in pmix3x_server_init () from /usr/local/lib/openmpi/mca_pmix_pmix3x.so #2  0x00007ffff7ef15ec in pmix_server_init () at orted/pmix/pmix_server.c:296
#3  0x00007ffff78cc8d5 in rte_init () at ess_hnp_module.c:329
#4  0x00007ffff7f6d836 in orte_init (pargc=0x7fffffffddbc, pargv=0x7fffffffddb0, flags=4) at runtime/orte_init.c:271 #5  0x00007ffff7f6f0cd in orte_submit_init (argc=1, argv=0x7fffffffe478, opts=0x0) at orted/orted_submit.c:570 #6  0x00005555555556bc in orterun (argc=1, argv=0x7fffffffe478) at orterun.c:136
#7  0x000055555555534d in main (argc=1, argv=0x7fffffffe478) at main.c:13

This build is using the latest openpmix from github master.

Thanks,
Andrej

Reply via email to