After speaking with Igor Ivanov about this this morning, he summarized his findings as follows:
1. Valgrind comes up clean. 2. The issue is not reproduced with a static build. 3. A bisection study reveals that problems first appear after commit: https://svn.open-mpi.org/trac/ompi/changeset/28800/trunk/opal/mca/base/mca_base_var.c Josh -----Original Message----- From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres (jsquyres) Sent: Monday, December 16, 2013 12:15 PM To: Open MPI Developers Subject: Re: [OMPI devel] bug in mca framework? It might be worthwhile to run this through valgrind and see if something is being freed incorrectly...? On Dec 16, 2013, at 12:11 PM, Nathan Hjelm <hje...@lanl.gov> wrote: > I took a look at the stacktraces last week and could not identify > where the bug is. I will dig deeper this week and see if I can come up with > the correct fix. > > -Nathan > > On Mon, Dec 09, 2013 at 03:17:36PM +0200, Mike Dubman wrote: >> Nathan, >> Could you please comment on the Igor`s observations? >> Thanks >> >> On Wed, Dec 4, 2013 at 4:44 PM, Igor Ivanov <igor.iva...@itseez.com> >> wrote: >> >> On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote: >> >> On Dec 4, 2013, at 2:52 AM, Igor Ivanov <igor.iva...@itseez.com> >> wrote: >> >> It is the first mca variable with type as string from btl/openib as >> 'device_param_files'. Actually you can disable it and get failure on >> the second. >> >> Description of case we see: >> 1. openib mca variables are registered during startup as stage at >> select component phase; >> 2. but a winner is cm component and openib mca variables are >> deregistered as part of mca group; >> 3. mca variables are not removed from global mca array but they >> marked as invalid and memory for string is freed; >> 4. shmem needs openib for yoda and does bml initialization; >> 5. openib mca variables are registered againusing light mode as >> searching itself in global array and refreshing their fields >> again; >> >> Can you explain what you mean by step 5? I.e., what does "using light >> mode" mean? Is the openib component register function invoked again? >> >> It is correct, it is called twice. "light mode" means that >> mca_base_var_register() does not allocate mca variable object again, it >> seeks this variable in global array and finding it updates fields in >> mca_base_var_t structure (at least mbv_storage). >> >> 6. for unknown reason bml finalization does not clean these vars as >> it is done in step 2; >> 7. mca_btl_openib.so is unloaded; >> 8. opal_finalize() destroys mca variables form global array, >> observes openib`s variable, try destroy using non accessed >> address; >> >> So a code that is under discussion fixes step 6. >> >> Nathan: it sounds like an MCA var (and entire group) is registered, >> unregistered, and then registered again. Does the MCA var system get >> confused here when it tries to unregister the group a 2nd time? >> >> Probably issue relates incorrect recognition if variable valid/invalid >> during second call of mca_base_var_deregister(). >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ _______________________________________________ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel