We're launching a seed daemon so that we can get registry persistence across multiple job launches. However, there is a race condition between launching the daemon and the first call to orte_init() that can result in a bus error. We set the OMPI_MCA_universe and OMPI_MCA_orte_univ_exist environment variables prior to calling orte_init() so that orte knows how to connect to the daemon, but if the daemon hasn't started this causes a bus error in orte_rds_base_close(). Stack trace below.

Exception:  EXC_BAD_ACCESS (0x0001)
Codes:      KERN_PROTECTION_FAILURE (0x0002) at 0x0000001c

Thread 0 Crashed:
0   libopen-rte.0.dylib         0x000c6d59 orte_rds_base_close + 66
1   libopen-rte.0.dylib         0x000a3ba7 orte_system_finalize + 121
2 libopen-rte.0.dylib 0x000d41f9 orte_sds_base_basic_contact_universe + 648
3   libopen-rte.0.dylib         0x000a06ce orte_init_stage1 + 898
4   libopen-rte.0.dylib         0x000a3c0b orte_system_init + 25
5   libopen-rte.0.dylib         0x000a0190 orte_init + 81

A related question, is there any way to check for the daemon other than calling orte_init()? At the moment we just sleep for a few seconds after launching the daemon, but this is obviously not a very satisfactory solution. I can't see any places where this is done in the source.

Thanks,

Greg

Reply via email to