Re: [OMPI devel] bug in mca framework?

2013-12-17 Thread Joshua Ladd
I believe Devendar Bureddy nailed the root cause. I am providing his excellent 
analysis below:

>From Devendar:

with curiosity i looked at this issue. here's my 2 cents
I think issue is because of BTL components is opened twice(ompi_init, 
yoda) which leading to incorrect usage of var groups. The following sequence of 
events creating invalid memory

1) all openib component parameters registered in ompi_mpi_init
main > start_pes> shmem_init -> oshmem_shmem_init -> ompi_mpi_init -> 
mca_base_framework_open -> mca_pml_base_open . mca_bml_base_open... -> 
btl_openib_component_register()

*   for all string variables it allocated a memory block (var->mbv_storage 
= PTR)

At this time a new var group id:114 (of parent group id: 112) is created for 
all openib component variables.

2) This var group is de-registered in ompi_mpi_init. It marks all variables as 
invalid. but, the group is still exist
main > start_pes> shmem_init -> oshmem_shmem_init -> mca_pml_base_select -> 
mca_base_components_close -> ... -> mca_bml_base_close -> 
mca_base_framework_close -> mca_base_var_group_deregister(groupid: 114) * all 
string variables memory is deallocated ( set var->mbv_storage = NULL;)

3) because of step 2). btl_openib.so shared lib dlclosed

4) Now we are reopening openib in yoda and registering the openib variables 
again.
main > start_pes> shmem_init > oshmem_shmem_init -> _shmem_init -> 
mca_base_framework_open -> mca_spml_base_open> mca_spml_yoda_component_open-> 
. mca_bml_base_open... -> btl_openib_component_register -> 
register_variables()

*   In register_variables(), var_find() finds this variable( from the same 
old group: 114) and reset the variables.
*   For string variables, it allocated the buffers again ( 
(var->mbv_storage = PTR)
*   note that group:114 is not belongs to yoda component.

5) In yoda component close, it never finds above group(114) because this is not 
belongs to this component. So, do not call mca_base_var_group_deregister() 
again on the var group. string var memory is not deallocated.
main > start_pes> shmem_init > oshmem_shmem_init -> _shmem_init -> 
mca_spml_base_select ->..> mca_spml_yoda_component_close -> mca_bml_base_close 
-> mca_base_var_group_find().

6) because of step 5), the btl_openib.so is dlclosed(). This step invalidates, 
all openib string vars memory ( var->mbv_storage = PTR) allocated in step 4)

7) in ompi_mpi_finalize(), it will loop through all vars and finalizes and 
deallocate the string var memory (var->mbv_storage = PTR)
ompi_mpi_finalize >...> mca_base_var_finalize * var->mbv_storage = PTR is 
invalid at this stage and causing the SEGFAULT. 


This also explains why Dinar's patch, kostul_fix.patch 
(http://bgate.mellanox.com/redmine/attachments/1643/kostul_fix.patch), resolves 
the issue. His patch prevents you from finding the invalid already opened 
params.
So, I see in a lot of these registration functions the signature has an entry 
for the project name, but now, NULL, is always passed. I see a note by Nathan in

../opal/mca/base/mca_base_var.c +1311
{
/* XXX -- component_update -- We will stash the project name in the component */
return mca_base_var_register (NULL, component->mca_type_name,


Seems knowing the project name, oshmem, would allow us to distinguish between 
the different BMLs.

Nathan, please advise.

Josh


-Original Message-
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan Hjelm
Sent: Monday, December 16, 2013 12:44 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] bug in mca framework?

On Mon, Dec 16, 2013 at 05:21:05PM +, Joshua Ladd wrote:
> After speaking with Igor Ivanov about this this morning, he summarized his 
> findings as follows:
> 
> 1. Valgrind comes up clean.

Thats good to hear but unfortunate since this seems really like a 
stomping-on-memory problem.

> 2. The issue is not reproduced with a static build.

This is a red-herring. The variable itself contains garbage. The mbv_storage 
pointer looked like it was on the stack, the name was not valid, etc. Not sure 
how we got an mca_base_var_t into that state since the only time we touch 
anything in them is in mca_base_var_finalize. That functions cleans up all of 
the state to two calls to it should be harmless.

> 3. A bisection study reveals that problems first appear after commit: 
> https://svn.open-mpi.org/trac/ompi/changeset/28800/trunk/opal/mca/base
> /mca_base_var.c

Possibly also a coincidence. That commit only 1) moves the group stuff into its 
own file, and 2) adds the mca_base_pvar interface. Its possible I messed 
something up in the rest of the code but unlikely. I will take another look 
though.

-Nathan

> 
> 
> Josh
> 
> -Original Message-----
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jef

Re: [OMPI devel] bug in mca framework?

2013-12-16 Thread Nathan Hjelm
On Mon, Dec 16, 2013 at 05:21:05PM +, Joshua Ladd wrote:
> After speaking with Igor Ivanov about this this morning, he summarized his 
> findings as follows:
> 
> 1. Valgrind comes up clean.

Thats good to hear but unfortunate since this seems really like a
stomping-on-memory problem.

> 2. The issue is not reproduced with a static build.

This is a red-herring. The variable itself contains garbage. The
mbv_storage pointer looked like it was on the stack, the name was not
valid, etc. Not sure how we got an mca_base_var_t into that state since
the only time we touch anything in them is in
mca_base_var_finalize. That functions cleans up all of the state to two
calls to it should be harmless.

> 3. A bisection study reveals that problems first appear after commit: 
> https://svn.open-mpi.org/trac/ompi/changeset/28800/trunk/opal/mca/base/mca_base_var.c

Possibly also a coincidence. That commit only 1) moves the group stuff
into its own file, and 2) adds the mca_base_pvar interface. Its possible
I messed something up in the rest of the code but unlikely. I will take
another look though.

-Nathan

> 
> 
> Josh
> 
> -Original Message-
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres 
> (jsquyres)
> Sent: Monday, December 16, 2013 12:15 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] bug in mca framework?
> 
> It might be worthwhile to run this through valgrind and see if something is 
> being freed incorrectly...?
> 
> 
> On Dec 16, 2013, at 12:11 PM, Nathan Hjelm <hje...@lanl.gov> wrote:
> 
> > I took a look at the stacktraces last week and could not identify 
> > where the bug is. I will dig deeper this week and see if I can come up with 
> > the correct fix.
> > 
> > -Nathan
> > 
> > On Mon, Dec 09, 2013 at 03:17:36PM +0200, Mike Dubman wrote:
> >>   Nathan,
> >>   Could you please comment on the Igor`s observations?
> >>   Thanks
> >> 
> >>   On Wed, Dec 4, 2013 at 4:44 PM, Igor Ivanov <igor.iva...@itseez.com>
> >>   wrote:
> >> 
> >> On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote:
> >> 
> >>   On Dec 4, 2013, at 2:52 AM, Igor Ivanov <igor.iva...@itseez.com>
> >>   wrote:
> >> 
> >> It is the first mca variable with type as string from btl/openib as
> >> 'device_param_files'. Actually you can disable it and get failure 
> >> on
> >> the second.
> >> 
> >> Description of case we see:
> >> 1. openib mca variables are registered during startup as stage at
> >> select component phase;
> >> 2. but a winner is cm component and openib mca variables are
> >> deregistered as part of mca group;
> >> 3. mca variables are not removed from global mca array but they
> >> marked as invalid and memory for string is freed;
> >> 4. shmem needs openib for yoda and does bml initialization;
> >> 5. openib mca variables are registered againusing light mode as
> >> searching itself in global array and refreshing their fields 
> >> again;
> >> 
> >>   Can you explain what you mean by step 5?  I.e., what does "using 
> >> light
> >>   mode" mean?  Is the openib component register function invoked again?
> >> 
> >> It is correct, it is called twice. "light mode" means that
> >> mca_base_var_register() does not allocate mca variable object again, it
> >> seeks this variable in global array and finding it updates fields in
> >> mca_base_var_t structure (at least mbv_storage).
> >> 
> >> 6. for unknown reason bml finalization does not clean these vars as
> >> it is done in step 2;
> >> 7. mca_btl_openib.so is unloaded;
> >> 8. opal_finalize() destroys mca variables form global array,
> >> observes openib`s variable, try destroy using non accessed 
> >> address;
> >> 
> >> So a code that is under discussion fixes step 6.
> >> 
> >>   Nathan: it sounds like an MCA var (and entire group) is registered,
> >>   unregistered, and then registered again. Does the MCA var system get
> >>   confused here when it tries to unregister the group a 2nd time?
> >> 
> >> Probably issue relates incorrect recognition if variable valid/invalid
> >> during second call of mca_base_var_deregister().
> >> 
> >> ___
> 

Re: [OMPI devel] bug in mca framework?

2013-12-16 Thread Nathan Hjelm
That is one possibility. The mca_base_var_t in question look like junk to me. 
Should be
impossible since variables are only destructed in mca_base_var_finalize. My 
guess is
that something is stomping on the variable memory.

-Nathan

On Mon, Dec 16, 2013 at 05:14:22PM +, Jeff Squyres (jsquyres) wrote:
> It might be worthwhile to run this through valgrind and see if something is 
> being freed incorrectly...?
> 
> 
> On Dec 16, 2013, at 12:11 PM, Nathan Hjelm  wrote:
> 
> > I took a look at the stacktraces last week and could not identify where the 
> > bug
> > is. I will dig deeper this week and see if I can come up with the correct 
> > fix.
> > 
> > -Nathan
> > 
> > On Mon, Dec 09, 2013 at 03:17:36PM +0200, Mike Dubman wrote:
> >>   Nathan,
> >>   Could you please comment on the Igor`s observations?
> >>   Thanks
> >> 
> >>   On Wed, Dec 4, 2013 at 4:44 PM, Igor Ivanov 
> >>   wrote:
> >> 
> >> On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote:
> >> 
> >>   On Dec 4, 2013, at 2:52 AM, Igor Ivanov 
> >>   wrote:
> >> 
> >> It is the first mca variable with type as string from btl/openib as
> >> 'device_param_files'. Actually you can disable it and get failure 
> >> on
> >> the second.
> >> 
> >> Description of case we see:
> >> 1. openib mca variables are registered during startup as stage at
> >> select component phase;
> >> 2. but a winner is cm component and openib mca variables are
> >> deregistered as part of mca group;
> >> 3. mca variables are not removed from global mca array but they
> >> marked as invalid and memory for string is freed;
> >> 4. shmem needs openib for yoda and does bml initialization;
> >> 5. openib mca variables are registered againusing light mode as
> >> searching itself in global array and refreshing their fields again;
> >> 
> >>   Can you explain what you mean by step 5?  I.e., what does "using 
> >> light
> >>   mode" mean?  Is the openib component register function invoked again?
> >> 
> >> It is correct, it is called twice. "light mode" means that
> >> mca_base_var_register() does not allocate mca variable object again, it
> >> seeks this variable in global array and finding it updates fields in
> >> mca_base_var_t structure (at least mbv_storage).
> >> 
> >> 6. for unknown reason bml finalization does not clean these vars as
> >> it is done in step 2;
> >> 7. mca_btl_openib.so is unloaded;
> >> 8. opal_finalize() destroys mca variables form global array,
> >> observes openib`s variable, try destroy using non accessed address;
> >> 
> >> So a code that is under discussion fixes step 6.
> >> 
> >>   Nathan: it sounds like an MCA var (and entire group) is registered,
> >>   unregistered, and then registered again. Does the MCA var system get
> >>   confused here when it tries to unregister the group a 2nd time?
> >> 
> >> Probably issue relates incorrect recognition if variable valid/invalid
> >> during second call of mca_base_var_deregister().
> >> 
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


pgpxar9h7GBPe.pgp
Description: PGP signature


Re: [OMPI devel] bug in mca framework?

2013-12-16 Thread Joshua Ladd
After speaking with Igor Ivanov about this this morning, he summarized his 
findings as follows:

1. Valgrind comes up clean.
2. The issue is not reproduced with a static build.
3. A bisection study reveals that problems first appear after commit: 
https://svn.open-mpi.org/trac/ompi/changeset/28800/trunk/opal/mca/base/mca_base_var.c


Josh

-Original Message-
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres 
(jsquyres)
Sent: Monday, December 16, 2013 12:15 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] bug in mca framework?

It might be worthwhile to run this through valgrind and see if something is 
being freed incorrectly...?


On Dec 16, 2013, at 12:11 PM, Nathan Hjelm <hje...@lanl.gov> wrote:

> I took a look at the stacktraces last week and could not identify 
> where the bug is. I will dig deeper this week and see if I can come up with 
> the correct fix.
> 
> -Nathan
> 
> On Mon, Dec 09, 2013 at 03:17:36PM +0200, Mike Dubman wrote:
>>   Nathan,
>>   Could you please comment on the Igor`s observations?
>>   Thanks
>> 
>>   On Wed, Dec 4, 2013 at 4:44 PM, Igor Ivanov <igor.iva...@itseez.com>
>>   wrote:
>> 
>> On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote:
>> 
>>   On Dec 4, 2013, at 2:52 AM, Igor Ivanov <igor.iva...@itseez.com>
>>   wrote:
>> 
>> It is the first mca variable with type as string from btl/openib as
>> 'device_param_files'. Actually you can disable it and get failure on
>> the second.
>> 
>> Description of case we see:
>> 1. openib mca variables are registered during startup as stage at
>> select component phase;
>> 2. but a winner is cm component and openib mca variables are
>> deregistered as part of mca group;
>> 3. mca variables are not removed from global mca array but they
>> marked as invalid and memory for string is freed;
>> 4. shmem needs openib for yoda and does bml initialization;
>> 5. openib mca variables are registered againusing light mode as
>> searching itself in global array and refreshing their fields 
>> again;
>> 
>>   Can you explain what you mean by step 5?  I.e., what does "using light
>>   mode" mean?  Is the openib component register function invoked again?
>> 
>> It is correct, it is called twice. "light mode" means that
>> mca_base_var_register() does not allocate mca variable object again, it
>> seeks this variable in global array and finding it updates fields in
>> mca_base_var_t structure (at least mbv_storage).
>> 
>> 6. for unknown reason bml finalization does not clean these vars as
>> it is done in step 2;
>> 7. mca_btl_openib.so is unloaded;
>> 8. opal_finalize() destroys mca variables form global array,
>> observes openib`s variable, try destroy using non accessed 
>> address;
>> 
>> So a code that is under discussion fixes step 6.
>> 
>>   Nathan: it sounds like an MCA var (and entire group) is registered,
>>   unregistered, and then registered again. Does the MCA var system get
>>   confused here when it tries to unregister the group a 2nd time?
>> 
>> Probably issue relates incorrect recognition if variable valid/invalid
>> during second call of mca_base_var_deregister().
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] bug in mca framework?

2013-12-16 Thread Jeff Squyres (jsquyres)
It might be worthwhile to run this through valgrind and see if something is 
being freed incorrectly...?


On Dec 16, 2013, at 12:11 PM, Nathan Hjelm  wrote:

> I took a look at the stacktraces last week and could not identify where the 
> bug
> is. I will dig deeper this week and see if I can come up with the correct fix.
> 
> -Nathan
> 
> On Mon, Dec 09, 2013 at 03:17:36PM +0200, Mike Dubman wrote:
>>   Nathan,
>>   Could you please comment on the Igor`s observations?
>>   Thanks
>> 
>>   On Wed, Dec 4, 2013 at 4:44 PM, Igor Ivanov 
>>   wrote:
>> 
>> On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote:
>> 
>>   On Dec 4, 2013, at 2:52 AM, Igor Ivanov 
>>   wrote:
>> 
>> It is the first mca variable with type as string from btl/openib as
>> 'device_param_files'. Actually you can disable it and get failure on
>> the second.
>> 
>> Description of case we see:
>> 1. openib mca variables are registered during startup as stage at
>> select component phase;
>> 2. but a winner is cm component and openib mca variables are
>> deregistered as part of mca group;
>> 3. mca variables are not removed from global mca array but they
>> marked as invalid and memory for string is freed;
>> 4. shmem needs openib for yoda and does bml initialization;
>> 5. openib mca variables are registered againusing light mode as
>> searching itself in global array and refreshing their fields again;
>> 
>>   Can you explain what you mean by step 5?  I.e., what does "using light
>>   mode" mean?  Is the openib component register function invoked again?
>> 
>> It is correct, it is called twice. "light mode" means that
>> mca_base_var_register() does not allocate mca variable object again, it
>> seeks this variable in global array and finding it updates fields in
>> mca_base_var_t structure (at least mbv_storage).
>> 
>> 6. for unknown reason bml finalization does not clean these vars as
>> it is done in step 2;
>> 7. mca_btl_openib.so is unloaded;
>> 8. opal_finalize() destroys mca variables form global array,
>> observes openib`s variable, try destroy using non accessed address;
>> 
>> So a code that is under discussion fixes step 6.
>> 
>>   Nathan: it sounds like an MCA var (and entire group) is registered,
>>   unregistered, and then registered again. Does the MCA var system get
>>   confused here when it tries to unregister the group a 2nd time?
>> 
>> Probably issue relates incorrect recognition if variable valid/invalid
>> during second call of mca_base_var_deregister().
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] bug in mca framework?

2013-12-16 Thread Nathan Hjelm
I took a look at the stacktraces last week and could not identify where the bug
is. I will dig deeper this week and see if I can come up with the correct fix.

-Nathan

On Mon, Dec 09, 2013 at 03:17:36PM +0200, Mike Dubman wrote:
>Nathan,
>Could you please comment on the Igor`s observations?
>Thanks
> 
>On Wed, Dec 4, 2013 at 4:44 PM, Igor Ivanov 
>wrote:
> 
>  On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote:
> 
>On Dec 4, 2013, at 2:52 AM, Igor Ivanov 
>wrote:
> 
>  It is the first mca variable with type as string from btl/openib as
>  'device_param_files'. Actually you can disable it and get failure on
>  the second.
> 
>  Description of case we see:
>  1. openib mca variables are registered during startup as stage at
>  select component phase;
>  2. but a winner is cm component and openib mca variables are
>  deregistered as part of mca group;
>  3. mca variables are not removed from global mca array but they
>  marked as invalid and memory for string is freed;
>  4. shmem needs openib for yoda and does bml initialization;
>  5. openib mca variables are registered againusing light mode as
>  searching itself in global array and refreshing their fields again;
> 
>Can you explain what you mean by step 5?  I.e., what does "using light
>mode" mean?  Is the openib component register function invoked again?
> 
>  It is correct, it is called twice. "light mode" means that
>  mca_base_var_register() does not allocate mca variable object again, it
>  seeks this variable in global array and finding it updates fields in
>  mca_base_var_t structure (at least mbv_storage).
> 
>  6. for unknown reason bml finalization does not clean these vars as
>  it is done in step 2;
>  7. mca_btl_openib.so is unloaded;
>  8. opal_finalize() destroys mca variables form global array,
>  observes openib`s variable, try destroy using non accessed address;
> 
>  So a code that is under discussion fixes step 6.
> 
>Nathan: it sounds like an MCA var (and entire group) is registered,
>unregistered, and then registered again. Does the MCA var system get
>confused here when it tries to unregister the group a 2nd time?
> 
>  Probably issue relates incorrect recognition if variable valid/invalid
>  during second call of mca_base_var_deregister().
> 
>  ___
>  devel mailing list
>  de...@open-mpi.org
>  http://www.open-mpi.org/mailman/listinfo.cgi/devel

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



pgpvqcaKCNBTb.pgp
Description: PGP signature


Re: [OMPI devel] bug in mca framework?

2013-12-09 Thread Mike Dubman
Nathan,
Could you please comment on the Igor`s observations?

Thanks


On Wed, Dec 4, 2013 at 4:44 PM, Igor Ivanov  wrote:

> On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote:
>
>> On Dec 4, 2013, at 2:52 AM, Igor Ivanov  wrote:
>>
>>  It is the first mca variable with type as string from btl/openib as
>>> 'device_param_files'. Actually you can disable it and get failure on the
>>> second.
>>>
>>> Description of case we see:
>>> 1. openib mca variables are registered during startup as stage at select
>>> component phase;
>>> 2. but a winner is cm component and openib mca variables are
>>> deregistered as part of mca group;
>>> 3. mca variables are not removed from global mca array but they marked
>>> as invalid and memory for string is freed;
>>> 4. shmem needs openib for yoda and does bml initialization;
>>> 5. openib mca variables are registered againusing light mode as
>>> searching itself in global array and refreshing their fields again;
>>>
>> Can you explain what you mean by step 5?  I.e., what does "using light
>> mode" mean?  Is the openib component register function invoked again?
>>
> It is correct, it is called twice. "light mode" means that
> mca_base_var_register() does not allocate mca variable object again, it
> seeks this variable in global array and finding it updates fields in
> mca_base_var_t structure (at least mbv_storage).
>
>
>>  6. for unknown reason bml finalization does not clean these vars as it
>>> is done in step 2;
>>> 7. mca_btl_openib.so is unloaded;
>>> 8. opal_finalize() destroys mca variables form global array, observes
>>> openib`s variable, try destroy using non accessed address;
>>>
>>> So a code that is under discussion fixes step 6.
>>>
>> Nathan: it sounds like an MCA var (and entire group) is registered,
>> unregistered, and then registered again. Does the MCA var system get
>> confused here when it tries to unregister the group a 2nd time?
>>
> Probably issue relates incorrect recognition if variable valid/invalid
> during second call of mca_base_var_deregister().
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] bug in mca framework?

2013-12-04 Thread Igor Ivanov

On 04.12.2013 17:56, Jeff Squyres (jsquyres) wrote:

On Dec 4, 2013, at 2:52 AM, Igor Ivanov  wrote:


It is the first mca variable with type as string from btl/openib as 
'device_param_files'. Actually you can disable it and get failure on the second.

Description of case we see:
1. openib mca variables are registered during startup as stage at select 
component phase;
2. but a winner is cm component and openib mca variables are deregistered as 
part of mca group;
3. mca variables are not removed from global mca array but they marked as 
invalid and memory for string is freed;
4. shmem needs openib for yoda and does bml initialization;
5. openib mca variables are registered againusing light mode as searching 
itself in global array and refreshing their fields again;

Can you explain what you mean by step 5?  I.e., what does "using light mode" 
mean?  Is the openib component register function invoked again?
It is correct, it is called twice. "light mode" means that 
mca_base_var_register() does not allocate mca variable object again, it 
seeks this variable in global array and finding it updates fields in 
mca_base_var_t structure (at least mbv_storage).



6. for unknown reason bml finalization does not clean these vars as it is done 
in step 2;
7. mca_btl_openib.so is unloaded;
8. opal_finalize() destroys mca variables form global array, observes openib`s 
variable, try destroy using non accessed address;

So a code that is under discussion fixes step 6.

Nathan: it sounds like an MCA var (and entire group) is registered, 
unregistered, and then registered again. Does the MCA var system get confused 
here when it tries to unregister the group a 2nd time?
Probably issue relates incorrect recognition if variable valid/invalid 
during second call of mca_base_var_deregister().




Re: [OMPI devel] bug in mca framework?

2013-12-04 Thread Jeff Squyres (jsquyres)
On Dec 4, 2013, at 2:52 AM, Igor Ivanov  wrote:

> It is the first mca variable with type as string from btl/openib as 
> 'device_param_files'. Actually you can disable it and get failure on the 
> second.
> 
> Description of case we see:
> 1. openib mca variables are registered during startup as stage at select 
> component phase;
> 2. but a winner is cm component and openib mca variables are deregistered as 
> part of mca group;
> 3. mca variables are not removed from global mca array but they marked as 
> invalid and memory for string is freed;
> 4. shmem needs openib for yoda and does bml initialization;
> 5. openib mca variables are registered againusing light mode as searching 
> itself in global array and refreshing their fields again;

Can you explain what you mean by step 5?  I.e., what does "using light mode" 
mean?  Is the openib component register function invoked again?

> 6. for unknown reason bml finalization does not clean these vars as it is 
> done in step 2;
> 7. mca_btl_openib.so is unloaded;
> 8. opal_finalize() destroys mca variables form global array, observes 
> openib`s variable, try destroy using non accessed address;
> 
> So a code that is under discussion fixes step 6.

Nathan: it sounds like an MCA var (and entire group) is registered, 
unregistered, and then registered again. Does the MCA var system get confused 
here when it tries to unregister the group a 2nd time?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] bug in mca framework?

2013-12-04 Thread Igor Ivanov
It is the first mca variable with type as string from btl/openib as 
'device_param_files'. Actually you can disable it and get failure on the 
second.


Description of case we see:
1. openib mca variables are registered during startup as stage at select 
component phase;
2. but a winner is cm component and openib mca variables are 
deregistered as part of mca group;
3. mca variables are not removed from global mca array but they marked 
as invalid and memory for string is freed;

4. shmem needs openib for yoda and does bml initialization;
5. openib mca variables are registered againusing light mode as 
searching itself in global array and refreshing their fields again;
6. for unknown reason bml finalization does not clean these vars as it 
is done in step 2;

7. mca_btl_openib.so is unloaded;
8. opal_finalize() destroys mca variables form global array, observes 
openib`s variable, try destroy using non accessed address;


So a code that is under discussion fixes step 6.

Igor

On 03.12.2013 23:01, Jeff Squyres (jsquyres) wrote:

I don't think there is one -- you'll need to print it from the debugger.


On Dec 3, 2013, at 1:38 PM, Mike Dubman  wrote:


thanks
what magic "-mca base_verbose" param should print it?


On Tue, Dec 3, 2013 at 6:59 PM, Nathan Hjelm  wrote:
This usually happens when a string that belongs to the MCA system is freed
elsewhere. Can you find out the name of the variable that is being destructed
in frame 2.

-Nathan Hjelm
Application Readiness, HPC-5, LANL

On Tue, Dec 03, 2013 at 02:53:29PM +0200, Mike Dubman wrote:

Hi,
We observe crash during shmem_finalize()  (in trunk) with new MCA
framework.
After investigation, found that  MCA tears-down process can access
previously released memory. (reproduced with oshmem_hello_c.c test)
0 0x7fffed3d51d0 in ?? ()
#1 
#2 0x7710e21e in var_destructor (var=0x6fa7e0) at
mca_base_var.c:1605
#3 0x7710ae99 in opal_obj_run_destructors (object=0x6fa7e0) at
../../../opal/class/opal_object.h:448
#4 0x7710ca18 in mca_base_var_finalize () at mca_base_var.c:954
#5 0x7710a7e2 in mca_base_param_finalize () at
mca_base_param.c:643
#6 0x770e08e2 in opal_finalize_util () at
runtime/opal_finalize.c:77
#7 0x77aa5319 in ompi_mpi_finalize () at
runtime/ompi_mpi_finalize.c:407
#8 0x77d900cc in oshmem_shmem_finalize () at
runtime/oshmem_shmem_finalize.c:75
#9 0x77d91119 in shmem_finalize () at shmem_finalize.c:24
#10 0x77d89b8f in __do_global_dtors_aux () from
/install/lib/libshmem.so.0
#11 0x in ?? ()
The crash can be resolved by following patch:
diff --git a/opal/mca/base/mca_base_var.c b/opal/mca/base/mca_base_var.c
index 9966627..48028d8 100644
--- a/opal/mca/base/mca_base_var.c
+++ b/opal/mca/base/mca_base_var.c
@@ -773,7 +773,7 @@ static int var_find_by_name (const char *full_name,
int *index, bool invalidok)

 (void) var_get ((int)(uintptr_t) tmp, , false);

-if (invalidok || VAR_IS_VALID(var[0])) {
+if (VAR_IS_VALID(var[0])) {
 *index = (int)(uintptr_t) tmp;
 return OPAL_SUCCESS;
 }
I`m not sure we understand yet why it fixes the problem and what is a
race.
Could some` with knowledge of MCA flows look at it and comment?
The "invalidok" was introduced by Jeff`s commit.
Thanks
M
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






Re: [OMPI devel] bug in mca framework?

2013-12-03 Thread Mike Dubman
thanks
what magic "-mca base_verbose" param should print it?


On Tue, Dec 3, 2013 at 6:59 PM, Nathan Hjelm  wrote:

> This usually happens when a string that belongs to the MCA system is freed
> elsewhere. Can you find out the name of the variable that is being
> destructed
> in frame 2.
>
> -Nathan Hjelm
> Application Readiness, HPC-5, LANL
>
> On Tue, Dec 03, 2013 at 02:53:29PM +0200, Mike Dubman wrote:
> >Hi,
> >We observe crash during shmem_finalize()  (in trunk) with new MCA
> >framework.
> >After investigation, found that  MCA tears-down process can access
> >previously released memory. (reproduced with oshmem_hello_c.c test)
> >0 0x7fffed3d51d0 in ?? ()
> >#1 
> >#2 0x7710e21e in var_destructor (var=0x6fa7e0) at
> >mca_base_var.c:1605
> >#3 0x7710ae99 in opal_obj_run_destructors (object=0x6fa7e0) at
> >../../../opal/class/opal_object.h:448
> >#4 0x7710ca18 in mca_base_var_finalize () at
> mca_base_var.c:954
> >#5 0x7710a7e2 in mca_base_param_finalize () at
> >mca_base_param.c:643
> >#6 0x770e08e2 in opal_finalize_util () at
> >runtime/opal_finalize.c:77
> >#7 0x77aa5319 in ompi_mpi_finalize () at
> >runtime/ompi_mpi_finalize.c:407
> >#8 0x77d900cc in oshmem_shmem_finalize () at
> >runtime/oshmem_shmem_finalize.c:75
> >#9 0x77d91119 in shmem_finalize () at shmem_finalize.c:24
> >#10 0x77d89b8f in __do_global_dtors_aux () from
> >/install/lib/libshmem.so.0
> >#11 0x in ?? ()
> >The crash can be resolved by following patch:
> >diff --git a/opal/mca/base/mca_base_var.c
> b/opal/mca/base/mca_base_var.c
> >index 9966627..48028d8 100644
> >--- a/opal/mca/base/mca_base_var.c
> >+++ b/opal/mca/base/mca_base_var.c
> >@@ -773,7 +773,7 @@ static int var_find_by_name (const char
> *full_name,
> >int *index, bool invalidok)
> >
> > (void) var_get ((int)(uintptr_t) tmp, , false);
> >
> >-if (invalidok || VAR_IS_VALID(var[0])) {
> >+if (VAR_IS_VALID(var[0])) {
> > *index = (int)(uintptr_t) tmp;
> > return OPAL_SUCCESS;
> > }
> >I`m not sure we understand yet why it fixes the problem and what is a
> >race.
> >Could some` with knowledge of MCA flows look at it and comment?
> >The "invalidok" was introduced by Jeff`s commit.
> >Thanks
> >M
>
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] bug in mca framework?

2013-12-03 Thread Nathan Hjelm
This usually happens when a string that belongs to the MCA system is freed
elsewhere. Can you find out the name of the variable that is being destructed
in frame 2.

-Nathan Hjelm
Application Readiness, HPC-5, LANL

On Tue, Dec 03, 2013 at 02:53:29PM +0200, Mike Dubman wrote:
>Hi,
>We observe crash during shmem_finalize()  (in trunk) with new MCA
>framework.
>After investigation, found that  MCA tears-down process can access
>previously released memory. (reproduced with oshmem_hello_c.c test)
>0 0x7fffed3d51d0 in ?? ()
>#1 
>#2 0x7710e21e in var_destructor (var=0x6fa7e0) at
>mca_base_var.c:1605
>#3 0x7710ae99 in opal_obj_run_destructors (object=0x6fa7e0) at
>../../../opal/class/opal_object.h:448
>#4 0x7710ca18 in mca_base_var_finalize () at mca_base_var.c:954
>#5 0x7710a7e2 in mca_base_param_finalize () at
>mca_base_param.c:643
>#6 0x770e08e2 in opal_finalize_util () at
>runtime/opal_finalize.c:77
>#7 0x77aa5319 in ompi_mpi_finalize () at
>runtime/ompi_mpi_finalize.c:407
>#8 0x77d900cc in oshmem_shmem_finalize () at
>runtime/oshmem_shmem_finalize.c:75
>#9 0x77d91119 in shmem_finalize () at shmem_finalize.c:24
>#10 0x77d89b8f in __do_global_dtors_aux () from
>/install/lib/libshmem.so.0
>#11 0x in ?? ()
>The crash can be resolved by following patch:
>diff --git a/opal/mca/base/mca_base_var.c b/opal/mca/base/mca_base_var.c
>index 9966627..48028d8 100644
>--- a/opal/mca/base/mca_base_var.c
>+++ b/opal/mca/base/mca_base_var.c
>@@ -773,7 +773,7 @@ static int var_find_by_name (const char *full_name,
>int *index, bool invalidok)
> 
> (void) var_get ((int)(uintptr_t) tmp, , false);
> 
>-if (invalidok || VAR_IS_VALID(var[0])) {
>+if (VAR_IS_VALID(var[0])) {
> *index = (int)(uintptr_t) tmp;
> return OPAL_SUCCESS;
> }
>I`m not sure we understand yet why it fixes the problem and what is a
>race.
>Could some` with knowledge of MCA flows look at it and comment?
>The "invalidok" was introduced by Jeff`s commit.
>Thanks
>M

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



pgppRTFozVjgF.pgp
Description: PGP signature