On Mar 8, 2013, at 15:10 , Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote:

> Isn't that what assert()'s are for?  :-)
> 
> The point is that for Ralph's case, the #2 assert would cause a notification 
> *right at the point of the error* (i.e., the errant DESTRUCT).  The other 
> tools you mention all cause notifications long after the DESTRUCT (i.e., when 
> the freed memory is used, at which time it may be totally jumbled/corrupt/the 
> FILE/line info gone).  You know how hard it can be to track down memory 
> corruption; having a notification right at the point of the error is a Good 
> Thing.

This is indeed true. However, a tool like valgrind keep trace of where the 
memory was allocated and freed, so this might provide enough info to identify 
and fix the issue.

> Don't forget that we have lots of other asserts in the OBJ system (in the 
> debug case only, of course).
> 
> And per my other mail, if I find legitimate cases where the destructors are 
> invoked where refcount!=1, we'll figure something else out.

I am not arguing about the validity of the assert, but about the fact that now 
we can't just OBJ_DESTRUCT something, we are required to first set it's 
refcount to 1.

I have a more advanced use case for you. Based on the MPI standard, 
MPI_Finalize can be called while the user still has non-complete requests 
returned by any of the non-blocking calls (there are some drawbacks of course, 
but this is not specifically prohibited). Thus, these requests will not have a 
ref count to 1, so they will not be able to be destructed. This is exactly what 
our code is doing today:

pml_base_close.c:58:    OBJ_DESTRUCT(&mca_pml_base_send_requests);
pml_base_close.c:59:    OBJ_DESTRUCT(&mca_pml_base_recv_requests);

and then ompi_freelist.c:86.

  George.


> 
> 
> 
> On Mar 8, 2013, at 7:22 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
>> I'm sorry Ralph, I'm puzzled by your approach. You knowingly use a broken 
>> example to justify a patch that under correct/consistent usage solve a 
>> non-issue?
>> 
>> Why can't you use any decent memory debugger (fence, purify or valgrind) to 
>> identify the issue you describe below?
>> 
>> George.
>> 
>> On Mar 8, 2013, at 02:19 , Ralph Castain <r...@open-mpi.org> wrote:
>> 
>>> 
>>> On Mar 7, 2013, at 4:37 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>> 
>>>> Please refrain from doing so, the assumption #1 this patch is based on is 
>>>> false. First, OBJ_CONSTRUCT can be run to construct a specific type of 
>>>> object in a preallocated memory region (not only on the stack or heap). In 
>>>> fact, it is the only way we can dynamically initialize an object in a 
>>>> memory allocated outside the OBJ_NEW system. Second, OBJ_CONSTRUCT is not 
>>>> necessarily matched with an OBJ_DESTRUCT, it work just fine with 
>>>> OBJ_RELEASE. In fact I use these feature in several places.
>>>> 
>>>> An example will be a memory region without a predefined size, that I 
>>>> manipulate as opal_list_item_t. This fragment gets allocated when it's 
>>>> size is know, then gets OBJ_CONSTRUCT'ed and then used. The reference 
>>>> count is playing its role, when nobody is using the object anymore, it 
>>>> will be automatically released. With the change you propose such usage 
>>>> will be prohibited. 
>>>> 
>>>> The feature you are looking for, the one that might have shorten Ralph's 
>>>> debugging time, is already in the opal_object_t. One should use the 
>>>> cls_init_file_name and cls_init_lineno fields to see where the object was 
>>>> first initialized as these fields are set either by the OBJ_NEW or by the 
>>>> OBJ_CONSTRUCT call.
>>> 
>>> Not exactly. Consider the case where we have a library - e.g., ORTE. Down 
>>> in the library, perhaps several function calls down, we receive a pointer 
>>> to an object. The library, not knowing any better, uses OBJ_RETAIN to 
>>> indicate that this object is being used and therefore should not be 
>>> released. It then returns and allows an async procedure to run.
>>> 
>>> Up above, one caller to the library uses OBJ_NEW to create the object. 
>>> Thus, the reference count system is in-play and governs when the data goes 
>>> away.
>>> 
>>> However, in another place, the caller uses OBJ_CONSTRUCT to initialize the 
>>> object, and OBJ_DESTRUCTs it when the high-level call returns. In this 
>>> case, the reference count system is ignored - OBJ_DESTRUCT destroys the 
>>> object regardless of the reference count. So suddenly the library is 
>>> working with garbage, with no way to know that it happened.
>>> 
>>> So now let's be specific to see how your suggestion doesn't solve the 
>>> problem (I actually had tried it). Consider the OOB operating 
>>> asynchronously. In the first case, where the opal_buffer_t is created via 
>>> OBJ_NEW, we can point the message system at the data field in the buffer 
>>> and just OBJ_RETAIN it.
>>> 
>>> However, in the second case, the OBJ_RETAIN won't work - the calling code 
>>> releases the data area, but the OOB has no idea that happened. So the 
>>> pointer to the data field actually isn't valid any more, but there is no 
>>> way to detect it. Likewise, looking for opal_object_t info is useless as 
>>> the fields no longer exist.
>>> 
>>> Yes, I realize this is an incorrect program - the caller isn't allowed to 
>>> release data until the async operation concludes. But the fact that a 
>>> library function, especially one that is hidden down low in the code base, 
>>> is async may not be immediately apparent. The resulting "bug" is extremely 
>>> hard to chase down, especially as it almost inevitably is exposed as a race 
>>> condition.
>>> 
>>> The goal of this RFC was to make such problems immediately apparent. 
>>> Perhaps one solution would be to get rid of OBJ_DESTRUCT and just have 
>>> everyone use OBJ_RELEASE, augmented with a flag to indicate whether or not 
>>> to free the backing memory. Another might be to have OBJ_DESTRUCT respect 
>>> ref counts, but Jeff, Brian and I didn't like that one. The proposed 
>>> solution was another way to approach it that would force the above scenario 
>>> to be recognized only during debug builds, thus allowing it to be 
>>> identified and corrected.
>>> 
>>> Your example use-case is certainly something I hadn't considered - so 
>>> perhaps we'll have to think of another way to detect my situation while 
>>> still allowing what you do, or perhaps add a new OBJ_CONSTRUCT_WITH_RETAIN 
>>> (or whatever) macro for use in your use-case that corrects the ref count?
>>> 
>>> 
>>>> 
>>>> George.
>>>> 
>>>> PS: The second patch (ref count == 1 in OBJ_DESTRUCT) is trivial but 
>>>> reasonable.
>>>> 
>>>> On Mar 7, 2013, at 22:10 , Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
>>>> wrote:
>>>> 
>>>>> WHAT: Simple patch that will fail an assert() if you OBJ_CONSTRUCT an 
>>>>> object and its ref count goes to 0 (in debug builds only).
>>>>> 
>>>>> WHY: To catch bugs.
>>>>> 
>>>>> WHERE: opal/class/opal_class.h
>>>>> 
>>>>> WHEN: Timeout of 1 week -- COB, Thurs, 14 Mar, 2013
>>>>> 
>>>>> MORE DETAIL:
>>>>> 
>>>>> On the call on Tuesday, we talked about some ideas for catching bugs with 
>>>>> respect to object reference counting.  After the call, Brian, Ralph, and 
>>>>> I came up with two useful asserts to help catch bugs (in debug builds 
>>>>> only):
>>>>> 
>>>>> 1. If you OBJ_CONSTRUCT an object, its ref count should never go to zero.
>>>>> 2. When a object is destroyed, its refcount should be exactly 1.
>>>>> 
>>>>> This RFC is for #1.  The attached patch doesn't seem to cause any 
>>>>> problems (and we didn't expect it to).  But it's a good addition to the 
>>>>> other asserts() that are already in the object code already.
>>>>> 
>>>>> As for #2, Ralph has previously found bugs in the ORTE layer that would 
>>>>> have been much easier to track down if #2 were in place.  I'll send an 
>>>>> RFC for #2 when I have managed to fix all the problems that it has found 
>>>>> in the OMPI layer...  :-)
>>>>> 
>>>>> -- 
>>>>> Jeff Squyres
>>>>> jsquy...@cisco.com
>>>>> For corporate legal information go to: 
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>> 
>>>>> Index: opal/class/opal_object.h
>>>>> ===================================================================
>>>>> --- opal/class/opal_object.h      (revision 28147)
>>>>> +++ opal/class/opal_object.h      (working copy)
>>>>> @@ -169,7 +169,7 @@
>>>>> * @param NAME   Name of the class to initialize
>>>>> */
>>>>> #if OPAL_ENABLE_DEBUG
>>>>> -#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, 
>>>>> OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ }
>>>>> +#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, 1, 
>>>>> OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ }
>>>>> #else
>>>>> #define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OBJ_CLASS(BASE_CLASS), 1 }
>>>>> #endif
>>>>> @@ -184,6 +184,10 @@
>>>>> /** Magic ID -- want this to be the very first item in the
>>>>>     struct's memory */
>>>>> uint64_t obj_magic_id;
>>>>> +    /* flag whether this was initialized using construct
>>>>> +     * versus obj_new
>>>>> +     */
>>>>> +    bool constructed;
>>>>> #endif
>>>>> opal_class_t *obj_class;            /**< class descriptor */
>>>>> volatile int32_t obj_reference_count;   /**< reference count */
>>>>> @@ -252,6 +256,7 @@
>>>>> object->obj_magic_id = OPAL_OBJ_MAGIC_ID;
>>>>> object->cls_init_file_name = file;
>>>>> object->cls_init_lineno = line;
>>>>> +    object->constructed = false;
>>>>> return object;
>>>>> }
>>>>> #define OBJ_NEW(type)                                   \
>>>>> @@ -313,6 +318,8 @@
>>>>>     assert(NULL != ((opal_object_t *) (object))->obj_class);        \
>>>>>     assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *) 
>>>>> (object))->obj_magic_id); \
>>>>>     if (0 == opal_obj_update((opal_object_t *) (object), -1)) {     \
>>>>> +            /* constructed objects are not allowed to reach 0 */        \
>>>>> +            assert(!(((opal_object_t *) (object))->constructed));       \
>>>>>         OBJ_SET_MAGIC_ID((object), 0);                              \
>>>>>         opal_obj_run_destructors((opal_object_t *) (object));       \
>>>>>         OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>>>>> @@ -344,6 +351,7 @@
>>>>> OBJ_CONSTRUCT_INTERNAL((object), OBJ_CLASS(type));          \
>>>>> } while (0)
>>>>> 
>>>>> +#if OPAL_ENABLE_DEBUG
>>>>> #define OBJ_CONSTRUCT_INTERNAL(object, type)                        \
>>>>> do {                                                                \
>>>>> OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID);              \
>>>>> @@ -352,11 +360,24 @@
>>>>> }                                                               \
>>>>> ((opal_object_t *) (object))->obj_class = (type);               \
>>>>> ((opal_object_t *) (object))->obj_reference_count = 1;          \
>>>>> +    ((opal_object_t *) (object))->constructed = true;               \
>>>>> opal_obj_run_constructors((opal_object_t *) (object));          \
>>>>> OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>>>>> } while (0)
>>>>> +#else
>>>>> +#define OBJ_CONSTRUCT_INTERNAL(object, type)                        \
>>>>> +do {                                                                \
>>>>> +    OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID);              \
>>>>> +    if (0 == (type)->cls_initialized) {                             \
>>>>> +        opal_class_initialize((type));                              \
>>>>> +    }                                                               \
>>>>> +    ((opal_object_t *) (object))->obj_class = (type);               \
>>>>> +    ((opal_object_t *) (object))->obj_reference_count = 1;          \
>>>>> +    opal_obj_run_constructors((opal_object_t *) (object));          \
>>>>> +    OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>>>>> +} while (0)
>>>>> +#endif
>>>>> 
>>>>> -
>>>>> /**
>>>>> * Destruct (finalize) an object that is not dynamically allocated.
>>>>> *
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to