I'm sorry Ralph, I'm puzzled by your approach. You knowingly use a broken 
example to justify a patch that under correct/consistent usage solve a 
non-issue?

Why can't you use any decent memory debugger (fence, purify or valgrind) to 
identify the issue you describe below?

  George.

On Mar 8, 2013, at 02:19 , Ralph Castain <r...@open-mpi.org> wrote:

> 
> On Mar 7, 2013, at 4:37 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
>> Please refrain from doing so, the assumption #1 this patch is based on is 
>> false. First, OBJ_CONSTRUCT can be run to construct a specific type of 
>> object in a preallocated memory region (not only on the stack or heap). In 
>> fact, it is the only way we can dynamically initialize an object in a memory 
>> allocated outside the OBJ_NEW system. Second, OBJ_CONSTRUCT is not 
>> necessarily matched with an OBJ_DESTRUCT, it work just fine with 
>> OBJ_RELEASE. In fact I use these feature in several places.
>> 
>> An example will be a memory region without a predefined size, that I 
>> manipulate as opal_list_item_t. This fragment gets allocated when it's size 
>> is know, then gets OBJ_CONSTRUCT'ed and then used. The reference count is 
>> playing its role, when nobody is using the object anymore, it will be 
>> automatically released. With the change you propose such usage will be 
>> prohibited. 
>> 
>> The feature you are looking for, the one that might have shorten Ralph's 
>> debugging time, is already in the opal_object_t. One should use the 
>> cls_init_file_name and cls_init_lineno fields to see where the object was 
>> first initialized as these fields are set either by the OBJ_NEW or by the 
>> OBJ_CONSTRUCT call.
> 
> Not exactly. Consider the case where we have a library - e.g., ORTE. Down in 
> the library, perhaps several function calls down, we receive a pointer to an 
> object. The library, not knowing any better, uses OBJ_RETAIN to indicate that 
> this object is being used and therefore should not be released. It then 
> returns and allows an async procedure to run.
> 
> Up above, one caller to the library uses OBJ_NEW to create the object. Thus, 
> the reference count system is in-play and governs when the data goes away.
> 
> However, in another place, the caller uses OBJ_CONSTRUCT to initialize the 
> object, and OBJ_DESTRUCTs it when the high-level call returns. In this case, 
> the reference count system is ignored - OBJ_DESTRUCT destroys the object 
> regardless of the reference count. So suddenly the library is working with 
> garbage, with no way to know that it happened.
> 
> So now let's be specific to see how your suggestion doesn't solve the problem 
> (I actually had tried it). Consider the OOB operating asynchronously. In the 
> first case, where the opal_buffer_t is created via OBJ_NEW, we can point the 
> message system at the data field in the buffer and just OBJ_RETAIN it.
> 
> However, in the second case, the OBJ_RETAIN won't work - the calling code 
> releases the data area, but the OOB has no idea that happened. So the pointer 
> to the data field actually isn't valid any more, but there is no way to 
> detect it. Likewise, looking for opal_object_t info is useless as the fields 
> no longer exist.
> 
> Yes, I realize this is an incorrect program - the caller isn't allowed to 
> release data until the async operation concludes. But the fact that a library 
> function, especially one that is hidden down low in the code base, is async 
> may not be immediately apparent. The resulting "bug" is extremely hard to 
> chase down, especially as it almost inevitably is exposed as a race condition.
> 
> The goal of this RFC was to make such problems immediately apparent. Perhaps 
> one solution would be to get rid of OBJ_DESTRUCT and just have everyone use 
> OBJ_RELEASE, augmented with a flag to indicate whether or not to free the 
> backing memory. Another might be to have OBJ_DESTRUCT respect ref counts, but 
> Jeff, Brian and I didn't like that one. The proposed solution was another way 
> to approach it that would force the above scenario to be recognized only 
> during debug builds, thus allowing it to be identified and corrected.
> 
> Your example use-case is certainly something I hadn't considered - so perhaps 
> we'll have to think of another way to detect my situation while still 
> allowing what you do, or perhaps add a new OBJ_CONSTRUCT_WITH_RETAIN (or 
> whatever) macro for use in your use-case that corrects the ref count?
> 
> 
>> 
>> George.
>> 
>> PS: The second patch (ref count == 1 in OBJ_DESTRUCT) is trivial but 
>> reasonable.
>> 
>> On Mar 7, 2013, at 22:10 , Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
>> wrote:
>> 
>>> WHAT: Simple patch that will fail an assert() if you OBJ_CONSTRUCT an 
>>> object and its ref count goes to 0 (in debug builds only).
>>> 
>>> WHY: To catch bugs.
>>> 
>>> WHERE: opal/class/opal_class.h
>>> 
>>> WHEN: Timeout of 1 week -- COB, Thurs, 14 Mar, 2013
>>> 
>>> MORE DETAIL:
>>> 
>>> On the call on Tuesday, we talked about some ideas for catching bugs with 
>>> respect to object reference counting.  After the call, Brian, Ralph, and I 
>>> came up with two useful asserts to help catch bugs (in debug builds only):
>>> 
>>> 1. If you OBJ_CONSTRUCT an object, its ref count should never go to zero.
>>> 2. When a object is destroyed, its refcount should be exactly 1.
>>> 
>>> This RFC is for #1.  The attached patch doesn't seem to cause any problems 
>>> (and we didn't expect it to).  But it's a good addition to the other 
>>> asserts() that are already in the object code already.
>>> 
>>> As for #2, Ralph has previously found bugs in the ORTE layer that would 
>>> have been much easier to track down if #2 were in place.  I'll send an RFC 
>>> for #2 when I have managed to fix all the problems that it has found in the 
>>> OMPI layer...  :-)
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> Index: opal/class/opal_object.h
>>> ===================================================================
>>> --- opal/class/opal_object.h        (revision 28147)
>>> +++ opal/class/opal_object.h        (working copy)
>>> @@ -169,7 +169,7 @@
>>> * @param NAME   Name of the class to initialize
>>> */
>>> #if OPAL_ENABLE_DEBUG
>>> -#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, 
>>> OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ }
>>> +#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, 1, 
>>> OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ }
>>> #else
>>> #define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OBJ_CLASS(BASE_CLASS), 1 }
>>> #endif
>>> @@ -184,6 +184,10 @@
>>>   /** Magic ID -- want this to be the very first item in the
>>>       struct's memory */
>>>   uint64_t obj_magic_id;
>>> +    /* flag whether this was initialized using construct
>>> +     * versus obj_new
>>> +     */
>>> +    bool constructed;
>>> #endif
>>>   opal_class_t *obj_class;            /**< class descriptor */
>>>   volatile int32_t obj_reference_count;   /**< reference count */
>>> @@ -252,6 +256,7 @@
>>>   object->obj_magic_id = OPAL_OBJ_MAGIC_ID;
>>>   object->cls_init_file_name = file;
>>>   object->cls_init_lineno = line;
>>> +    object->constructed = false;
>>>   return object;
>>> }
>>> #define OBJ_NEW(type)                                   \
>>> @@ -313,6 +318,8 @@
>>>       assert(NULL != ((opal_object_t *) (object))->obj_class);        \
>>>       assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *) 
>>> (object))->obj_magic_id); \
>>>       if (0 == opal_obj_update((opal_object_t *) (object), -1)) {     \
>>> +            /* constructed objects are not allowed to reach 0 */        \
>>> +            assert(!(((opal_object_t *) (object))->constructed));       \
>>>           OBJ_SET_MAGIC_ID((object), 0);                              \
>>>           opal_obj_run_destructors((opal_object_t *) (object));       \
>>>           OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>>> @@ -344,6 +351,7 @@
>>>   OBJ_CONSTRUCT_INTERNAL((object), OBJ_CLASS(type));          \
>>> } while (0)
>>> 
>>> +#if OPAL_ENABLE_DEBUG
>>> #define OBJ_CONSTRUCT_INTERNAL(object, type)                        \
>>> do {                                                                \
>>>   OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID);              \
>>> @@ -352,11 +360,24 @@
>>>   }                                                               \
>>>   ((opal_object_t *) (object))->obj_class = (type);               \
>>>   ((opal_object_t *) (object))->obj_reference_count = 1;          \
>>> +    ((opal_object_t *) (object))->constructed = true;               \
>>>   opal_obj_run_constructors((opal_object_t *) (object));          \
>>>   OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>>> } while (0)
>>> +#else
>>> +#define OBJ_CONSTRUCT_INTERNAL(object, type)                        \
>>> +do {                                                                \
>>> +    OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID);              \
>>> +    if (0 == (type)->cls_initialized) {                             \
>>> +        opal_class_initialize((type));                              \
>>> +    }                                                               \
>>> +    ((opal_object_t *) (object))->obj_class = (type);               \
>>> +    ((opal_object_t *) (object))->obj_reference_count = 1;          \
>>> +    opal_obj_run_constructors((opal_object_t *) (object));          \
>>> +    OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>>> +} while (0)
>>> +#endif
>>> 
>>> -
>>> /**
>>> * Destruct (finalize) an object that is not dynamically allocated.
>>> *
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to