That is one possibility. The mca_base_var_t in question look like junk to me. Should be impossible since variables are only destructed in mca_base_var_finalize. My guess is that something is stomping on the variable memory.
-Nathan On Mon, Dec 16, 2013 at 05:14:22PM +0000, Jeff Squyres (jsquyres) wrote: > It might be worthwhile to run this through valgrind and see if something is > being freed incorrectly...? > > > On Dec 16, 2013, at 12:11 PM, Nathan Hjelm <hje...@lanl.gov> wrote: > > > I took a look at the stacktraces last week and could not identify where the > > bug > > is. I will dig deeper this week and see if I can come up with the correct > > fix. > > > > -Nathan > > > > On Mon, Dec 09, 2013 at 03:17:36PM +0200, Mike Dubman wrote: > >> Nathan, > >> Could you please comment on the Igor`s observations? > >> Thanks > >> > >> On Wed, Dec 4, 2013 at 4:44 PM, Igor Ivanov <igor.iva...@itseez.com> > >> wrote: > >> > >> On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote: > >> > >> On Dec 4, 2013, at 2:52 AM, Igor Ivanov <igor.iva...@itseez.com> > >> wrote: > >> > >> It is the first mca variable with type as string from btl/openib as > >> 'device_param_files'. Actually you can disable it and get failure > >> on > >> the second. > >> > >> Description of case we see: > >> 1. openib mca variables are registered during startup as stage at > >> select component phase; > >> 2. but a winner is cm component and openib mca variables are > >> deregistered as part of mca group; > >> 3. mca variables are not removed from global mca array but they > >> marked as invalid and memory for string is freed; > >> 4. shmem needs openib for yoda and does bml initialization; > >> 5. openib mca variables are registered againusing light mode as > >> searching itself in global array and refreshing their fields again; > >> > >> Can you explain what you mean by step 5? I.e., what does "using > >> light > >> mode" mean? Is the openib component register function invoked again? > >> > >> It is correct, it is called twice. "light mode" means that > >> mca_base_var_register() does not allocate mca variable object again, it > >> seeks this variable in global array and finding it updates fields in > >> mca_base_var_t structure (at least mbv_storage). > >> > >> 6. for unknown reason bml finalization does not clean these vars as > >> it is done in step 2; > >> 7. mca_btl_openib.so is unloaded; > >> 8. opal_finalize() destroys mca variables form global array, > >> observes openib`s variable, try destroy using non accessed address; > >> > >> So a code that is under discussion fixes step 6. > >> > >> Nathan: it sounds like an MCA var (and entire group) is registered, > >> unregistered, and then registered again. Does the MCA var system get > >> confused here when it tries to unregister the group a 2nd time? > >> > >> Probably issue relates incorrect recognition if variable valid/invalid > >> during second call of mca_base_var_deregister(). > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
pgpxar9h7GBPe.pgp
Description: PGP signature