I took a look at the stacktraces last week and could not identify where the bug
is. I will dig deeper this week and see if I can come up with the correct fix.

-Nathan

On Mon, Dec 09, 2013 at 03:17:36PM +0200, Mike Dubman wrote:
>    Nathan,
>    Could you please comment on the Igor`s observations?
>    Thanks
> 
>    On Wed, Dec 4, 2013 at 4:44 PM, Igor Ivanov <igor.iva...@itseez.com>
>    wrote:
> 
>      On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote:
> 
>        On Dec 4, 2013, at 2:52 AM, Igor Ivanov <igor.iva...@itseez.com>
>        wrote:
> 
>          It is the first mca variable with type as string from btl/openib as
>          'device_param_files'. Actually you can disable it and get failure on
>          the second.
> 
>          Description of case we see:
>          1. openib mca variables are registered during startup as stage at
>          select component phase;
>          2. but a winner is cm component and openib mca variables are
>          deregistered as part of mca group;
>          3. mca variables are not removed from global mca array but they
>          marked as invalid and memory for string is freed;
>          4. shmem needs openib for yoda and does bml initialization;
>          5. openib mca variables are registered againusing light mode as
>          searching itself in global array and refreshing their fields again;
> 
>        Can you explain what you mean by step 5?  I.e., what does "using light
>        mode" mean?  Is the openib component register function invoked again?
> 
>      It is correct, it is called twice. "light mode" means that
>      mca_base_var_register() does not allocate mca variable object again, it
>      seeks this variable in global array and finding it updates fields in
>      mca_base_var_t structure (at least mbv_storage).
> 
>          6. for unknown reason bml finalization does not clean these vars as
>          it is done in step 2;
>          7. mca_btl_openib.so is unloaded;
>          8. opal_finalize() destroys mca variables form global array,
>          observes openib`s variable, try destroy using non accessed address;
> 
>          So a code that is under discussion fixes step 6.
> 
>        Nathan: it sounds like an MCA var (and entire group) is registered,
>        unregistered, and then registered again. Does the MCA var system get
>        confused here when it tries to unregister the group a 2nd time?
> 
>      Probably issue relates incorrect recognition if variable valid/invalid
>      during second call of mca_base_var_deregister().
> 
>      _______________________________________________
>      devel mailing list
>      de...@open-mpi.org
>      http://www.open-mpi.org/mailman/listinfo.cgi/devel

> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Attachment: pgpvqcaKCNBTb.pgp
Description: PGP signature

Reply via email to