Isn't that what assert()'s are for?  :-)

The point is that for Ralph's case, the #2 assert would cause a notification 
*right at the point of the error* (i.e., the errant DESTRUCT).  The other tools 
you mention all cause notifications long after the DESTRUCT (i.e., when the 
freed memory is used, at which time it may be totally jumbled/corrupt/the 
FILE/line info gone).  You know how hard it can be to track down memory 
corruption; having a notification right at the point of the error is a Good 
Thing.

Don't forget that we have lots of other asserts in the OBJ system (in the debug 
case only, of course).

And per my other mail, if I find legitimate cases where the destructors are 
invoked where refcount!=1, we'll figure something else out.



On Mar 8, 2013, at 7:22 AM, George Bosilca <bosi...@icl.utk.edu> wrote:

> I'm sorry Ralph, I'm puzzled by your approach. You knowingly use a broken 
> example to justify a patch that under correct/consistent usage solve a 
> non-issue?
> 
> Why can't you use any decent memory debugger (fence, purify or valgrind) to 
> identify the issue you describe below?
> 
>  George.
> 
> On Mar 8, 2013, at 02:19 , Ralph Castain <r...@open-mpi.org> wrote:
> 
>> 
>> On Mar 7, 2013, at 4:37 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>> 
>>> Please refrain from doing so, the assumption #1 this patch is based on is 
>>> false. First, OBJ_CONSTRUCT can be run to construct a specific type of 
>>> object in a preallocated memory region (not only on the stack or heap). In 
>>> fact, it is the only way we can dynamically initialize an object in a 
>>> memory allocated outside the OBJ_NEW system. Second, OBJ_CONSTRUCT is not 
>>> necessarily matched with an OBJ_DESTRUCT, it work just fine with 
>>> OBJ_RELEASE. In fact I use these feature in several places.
>>> 
>>> An example will be a memory region without a predefined size, that I 
>>> manipulate as opal_list_item_t. This fragment gets allocated when it's size 
>>> is know, then gets OBJ_CONSTRUCT'ed and then used. The reference count is 
>>> playing its role, when nobody is using the object anymore, it will be 
>>> automatically released. With the change you propose such usage will be 
>>> prohibited. 
>>> 
>>> The feature you are looking for, the one that might have shorten Ralph's 
>>> debugging time, is already in the opal_object_t. One should use the 
>>> cls_init_file_name and cls_init_lineno fields to see where the object was 
>>> first initialized as these fields are set either by the OBJ_NEW or by the 
>>> OBJ_CONSTRUCT call.
>> 
>> Not exactly. Consider the case where we have a library - e.g., ORTE. Down in 
>> the library, perhaps several function calls down, we receive a pointer to an 
>> object. The library, not knowing any better, uses OBJ_RETAIN to indicate 
>> that this object is being used and therefore should not be released. It then 
>> returns and allows an async procedure to run.
>> 
>> Up above, one caller to the library uses OBJ_NEW to create the object. Thus, 
>> the reference count system is in-play and governs when the data goes away.
>> 
>> However, in another place, the caller uses OBJ_CONSTRUCT to initialize the 
>> object, and OBJ_DESTRUCTs it when the high-level call returns. In this case, 
>> the reference count system is ignored - OBJ_DESTRUCT destroys the object 
>> regardless of the reference count. So suddenly the library is working with 
>> garbage, with no way to know that it happened.
>> 
>> So now let's be specific to see how your suggestion doesn't solve the 
>> problem (I actually had tried it). Consider the OOB operating 
>> asynchronously. In the first case, where the opal_buffer_t is created via 
>> OBJ_NEW, we can point the message system at the data field in the buffer and 
>> just OBJ_RETAIN it.
>> 
>> However, in the second case, the OBJ_RETAIN won't work - the calling code 
>> releases the data area, but the OOB has no idea that happened. So the 
>> pointer to the data field actually isn't valid any more, but there is no way 
>> to detect it. Likewise, looking for opal_object_t info is useless as the 
>> fields no longer exist.
>> 
>> Yes, I realize this is an incorrect program - the caller isn't allowed to 
>> release data until the async operation concludes. But the fact that a 
>> library function, especially one that is hidden down low in the code base, 
>> is async may not be immediately apparent. The resulting "bug" is extremely 
>> hard to chase down, especially as it almost inevitably is exposed as a race 
>> condition.
>> 
>> The goal of this RFC was to make such problems immediately apparent. Perhaps 
>> one solution would be to get rid of OBJ_DESTRUCT and just have everyone use 
>> OBJ_RELEASE, augmented with a flag to indicate whether or not to free the 
>> backing memory. Another might be to have OBJ_DESTRUCT respect ref counts, 
>> but Jeff, Brian and I didn't like that one. The proposed solution was 
>> another way to approach it that would force the above scenario to be 
>> recognized only during debug builds, thus allowing it to be identified and 
>> corrected.
>> 
>> Your example use-case is certainly something I hadn't considered - so 
>> perhaps we'll have to think of another way to detect my situation while 
>> still allowing what you do, or perhaps add a new OBJ_CONSTRUCT_WITH_RETAIN 
>> (or whatever) macro for use in your use-case that corrects the ref count?
>> 
>> 
>>> 
>>> George.
>>> 
>>> PS: The second patch (ref count == 1 in OBJ_DESTRUCT) is trivial but 
>>> reasonable.
>>> 
>>> On Mar 7, 2013, at 22:10 , Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
>>> wrote:
>>> 
>>>> WHAT: Simple patch that will fail an assert() if you OBJ_CONSTRUCT an 
>>>> object and its ref count goes to 0 (in debug builds only).
>>>> 
>>>> WHY: To catch bugs.
>>>> 
>>>> WHERE: opal/class/opal_class.h
>>>> 
>>>> WHEN: Timeout of 1 week -- COB, Thurs, 14 Mar, 2013
>>>> 
>>>> MORE DETAIL:
>>>> 
>>>> On the call on Tuesday, we talked about some ideas for catching bugs with 
>>>> respect to object reference counting.  After the call, Brian, Ralph, and I 
>>>> came up with two useful asserts to help catch bugs (in debug builds only):
>>>> 
>>>> 1. If you OBJ_CONSTRUCT an object, its ref count should never go to zero.
>>>> 2. When a object is destroyed, its refcount should be exactly 1.
>>>> 
>>>> This RFC is for #1.  The attached patch doesn't seem to cause any problems 
>>>> (and we didn't expect it to).  But it's a good addition to the other 
>>>> asserts() that are already in the object code already.
>>>> 
>>>> As for #2, Ralph has previously found bugs in the ORTE layer that would 
>>>> have been much easier to track down if #2 were in place.  I'll send an RFC 
>>>> for #2 when I have managed to fix all the problems that it has found in 
>>>> the OMPI layer...  :-)
>>>> 
>>>> -- 
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> For corporate legal information go to: 
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>> 
>>>> Index: opal/class/opal_object.h
>>>> ===================================================================
>>>> --- opal/class/opal_object.h       (revision 28147)
>>>> +++ opal/class/opal_object.h       (working copy)
>>>> @@ -169,7 +169,7 @@
>>>> * @param NAME   Name of the class to initialize
>>>> */
>>>> #if OPAL_ENABLE_DEBUG
>>>> -#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, 
>>>> OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ }
>>>> +#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, 1, 
>>>> OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ }
>>>> #else
>>>> #define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OBJ_CLASS(BASE_CLASS), 1 }
>>>> #endif
>>>> @@ -184,6 +184,10 @@
>>>>  /** Magic ID -- want this to be the very first item in the
>>>>      struct's memory */
>>>>  uint64_t obj_magic_id;
>>>> +    /* flag whether this was initialized using construct
>>>> +     * versus obj_new
>>>> +     */
>>>> +    bool constructed;
>>>> #endif
>>>>  opal_class_t *obj_class;            /**< class descriptor */
>>>>  volatile int32_t obj_reference_count;   /**< reference count */
>>>> @@ -252,6 +256,7 @@
>>>>  object->obj_magic_id = OPAL_OBJ_MAGIC_ID;
>>>>  object->cls_init_file_name = file;
>>>>  object->cls_init_lineno = line;
>>>> +    object->constructed = false;
>>>>  return object;
>>>> }
>>>> #define OBJ_NEW(type)                                   \
>>>> @@ -313,6 +318,8 @@
>>>>      assert(NULL != ((opal_object_t *) (object))->obj_class);        \
>>>>      assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *) 
>>>> (object))->obj_magic_id); \
>>>>      if (0 == opal_obj_update((opal_object_t *) (object), -1)) {     \
>>>> +            /* constructed objects are not allowed to reach 0 */        \
>>>> +            assert(!(((opal_object_t *) (object))->constructed));       \
>>>>          OBJ_SET_MAGIC_ID((object), 0);                              \
>>>>          opal_obj_run_destructors((opal_object_t *) (object));       \
>>>>          OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>>>> @@ -344,6 +351,7 @@
>>>>  OBJ_CONSTRUCT_INTERNAL((object), OBJ_CLASS(type));          \
>>>> } while (0)
>>>> 
>>>> +#if OPAL_ENABLE_DEBUG
>>>> #define OBJ_CONSTRUCT_INTERNAL(object, type)                        \
>>>> do {                                                                \
>>>>  OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID);              \
>>>> @@ -352,11 +360,24 @@
>>>>  }                                                               \
>>>>  ((opal_object_t *) (object))->obj_class = (type);               \
>>>>  ((opal_object_t *) (object))->obj_reference_count = 1;          \
>>>> +    ((opal_object_t *) (object))->constructed = true;               \
>>>>  opal_obj_run_constructors((opal_object_t *) (object));          \
>>>>  OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>>>> } while (0)
>>>> +#else
>>>> +#define OBJ_CONSTRUCT_INTERNAL(object, type)                        \
>>>> +do {                                                                \
>>>> +    OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID);              \
>>>> +    if (0 == (type)->cls_initialized) {                             \
>>>> +        opal_class_initialize((type));                              \
>>>> +    }                                                               \
>>>> +    ((opal_object_t *) (object))->obj_class = (type);               \
>>>> +    ((opal_object_t *) (object))->obj_reference_count = 1;          \
>>>> +    opal_obj_run_constructors((opal_object_t *) (object));          \
>>>> +    OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>>>> +} while (0)
>>>> +#endif
>>>> 
>>>> -
>>>> /**
>>>> * Destruct (finalize) an object that is not dynamically allocated.
>>>> *
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to