That is one possibility. The mca_base_var_t in question look like junk to me. 
Should be
impossible since variables are only destructed in mca_base_var_finalize. My 
guess is
that something is stomping on the variable memory.

-Nathan

On Mon, Dec 16, 2013 at 05:14:22PM +0000, Jeff Squyres (jsquyres) wrote:
> It might be worthwhile to run this through valgrind and see if something is 
> being freed incorrectly...?
> 
> 
> On Dec 16, 2013, at 12:11 PM, Nathan Hjelm <hje...@lanl.gov> wrote:
> 
> > I took a look at the stacktraces last week and could not identify where the 
> > bug
> > is. I will dig deeper this week and see if I can come up with the correct 
> > fix.
> > 
> > -Nathan
> > 
> > On Mon, Dec 09, 2013 at 03:17:36PM +0200, Mike Dubman wrote:
> >>   Nathan,
> >>   Could you please comment on the Igor`s observations?
> >>   Thanks
> >> 
> >>   On Wed, Dec 4, 2013 at 4:44 PM, Igor Ivanov <igor.iva...@itseez.com>
> >>   wrote:
> >> 
> >>     On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote:
> >> 
> >>       On Dec 4, 2013, at 2:52 AM, Igor Ivanov <igor.iva...@itseez.com>
> >>       wrote:
> >> 
> >>         It is the first mca variable with type as string from btl/openib as
> >>         'device_param_files'. Actually you can disable it and get failure 
> >> on
> >>         the second.
> >> 
> >>         Description of case we see:
> >>         1. openib mca variables are registered during startup as stage at
> >>         select component phase;
> >>         2. but a winner is cm component and openib mca variables are
> >>         deregistered as part of mca group;
> >>         3. mca variables are not removed from global mca array but they
> >>         marked as invalid and memory for string is freed;
> >>         4. shmem needs openib for yoda and does bml initialization;
> >>         5. openib mca variables are registered againusing light mode as
> >>         searching itself in global array and refreshing their fields again;
> >> 
> >>       Can you explain what you mean by step 5?  I.e., what does "using 
> >> light
> >>       mode" mean?  Is the openib component register function invoked again?
> >> 
> >>     It is correct, it is called twice. "light mode" means that
> >>     mca_base_var_register() does not allocate mca variable object again, it
> >>     seeks this variable in global array and finding it updates fields in
> >>     mca_base_var_t structure (at least mbv_storage).
> >> 
> >>         6. for unknown reason bml finalization does not clean these vars as
> >>         it is done in step 2;
> >>         7. mca_btl_openib.so is unloaded;
> >>         8. opal_finalize() destroys mca variables form global array,
> >>         observes openib`s variable, try destroy using non accessed address;
> >> 
> >>         So a code that is under discussion fixes step 6.
> >> 
> >>       Nathan: it sounds like an MCA var (and entire group) is registered,
> >>       unregistered, and then registered again. Does the MCA var system get
> >>       confused here when it tries to unregister the group a 2nd time?
> >> 
> >>     Probably issue relates incorrect recognition if variable valid/invalid
> >>     during second call of mca_base_var_deregister().
> >> 
> >>     _______________________________________________
> >>     devel mailing list
> >>     de...@open-mpi.org
> >>     http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Attachment: pgpxar9h7GBPe.pgp
Description: PGP signature

Reply via email to