We're launching a seed daemon so that we can get registry persistence
across multiple job launches. However, there is a race condition
between launching the daemon and the first call to orte_init() that
can result in a bus error. We set the OMPI_MCA_universe and
OMPI_MCA_orte_univ_exist environment variables prior to calling
orte_init() so that orte knows how to connect to the daemon, but if
the daemon hasn't started this causes a bus error in
orte_rds_base_close(). Stack trace below.
Exception: EXC_BAD_ACCESS (0x0001)
Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x0000001c
Thread 0 Crashed:
0 libopen-rte.0.dylib 0x000c6d59 orte_rds_base_close + 66
1 libopen-rte.0.dylib 0x000a3ba7 orte_system_finalize + 121
2 libopen-rte.0.dylib 0x000d41f9
orte_sds_base_basic_contact_universe + 648
3 libopen-rte.0.dylib 0x000a06ce orte_init_stage1 + 898
4 libopen-rte.0.dylib 0x000a3c0b orte_system_init + 25
5 libopen-rte.0.dylib 0x000a0190 orte_init + 81
A related question, is there any way to check for the daemon other
than calling orte_init()? At the moment we just sleep for a few
seconds after launching the daemon, but this is obviously not a very
satisfactory solution. I can't see any places where this is done in
the source.
Thanks,
Greg