On Feb 3, 2007, at 6:51 AM, Ralph Castain wrote:




On 2/2/07 8:44 AM, "Greg Watson" <gwat...@lanl.gov> wrote:

We're launching a seed daemon so that we can get registry persistence
across multiple job launches. However, there is a race condition
between launching the daemon and the first call to orte_init() that
can result in a bus error. We set the OMPI_MCA_universe and
OMPI_MCA_orte_univ_exist environment variables prior to calling
orte_init() so that orte knows how to connect to the daemon, but if
the daemon hasn't started this causes a bus error in
orte_rds_base_close(). Stack trace below.

Exception:  EXC_BAD_ACCESS (0x0001)
Codes:      KERN_PROTECTION_FAILURE (0x0002) at 0x0000001c

Thread 0 Crashed:
0   libopen-rte.0.dylib  0x000c6d59 orte_rds_base_close + 66
1   libopen-rte.0.dylib  0x000a3ba7 orte_system_finalize + 121
2   libopen-rte.0.dylib  0x000d41f9
orte_sds_base_basic_contact_universe + 648
3   libopen-rte.0.dylib  0x000a06ce orte_init_stage1 + 898
4   libopen-rte.0.dylib  0x000a3c0b orte_system_init + 25
5   libopen-rte.0.dylib  0x000a0190 orte_init + 81


Hmmm...can you tell me which version you are working with? Obviously, that shouldn't happen. My best initial guess is that rds is being opened, but hasn't selected components yet when we try to contact the universe. When that fails and we call finalize, rds tries to "close" a component list that
is NULL. I can look into that.

1.2b3

Greg


Reply via email to