On Mar 7, 2013, at 4:37 PM, George Bosilca <bosi...@icl.utk.edu> wrote:

> Please refrain from doing so, the assumption #1 this patch is based on is 
> false. First, OBJ_CONSTRUCT can be run to construct a specific type of object 
> in a preallocated memory region (not only on the stack or heap). In fact, it 
> is the only way we can dynamically initialize an object in a memory allocated 
> outside the OBJ_NEW system. Second, OBJ_CONSTRUCT is not necessarily matched 
> with an OBJ_DESTRUCT, it work just fine with OBJ_RELEASE. In fact I use these 
> feature in several places.
> 
> An example will be a memory region without a predefined size, that I 
> manipulate as opal_list_item_t. This fragment gets allocated when it's size 
> is know, then gets OBJ_CONSTRUCT'ed and then used. The reference count is 
> playing its role, when nobody is using the object anymore, it will be 
> automatically released. With the change you propose such usage will be 
> prohibited. 
> 
> The feature you are looking for, the one that might have shorten Ralph's 
> debugging time, is already in the opal_object_t. One should use the 
> cls_init_file_name and cls_init_lineno fields to see where the object was 
> first initialized as these fields are set either by the OBJ_NEW or by the 
> OBJ_CONSTRUCT call.

Not exactly. Consider the case where we have a library - e.g., ORTE. Down in 
the library, perhaps several function calls down, we receive a pointer to an 
object. The library, not knowing any better, uses OBJ_RETAIN to indicate that 
this object is being used and therefore should not be released. It then returns 
and allows an async procedure to run.

Up above, one caller to the library uses OBJ_NEW to create the object. Thus, 
the reference count system is in-play and governs when the data goes away.

However, in another place, the caller uses OBJ_CONSTRUCT to initialize the 
object, and OBJ_DESTRUCTs it when the high-level call returns. In this case, 
the reference count system is ignored - OBJ_DESTRUCT destroys the object 
regardless of the reference count. So suddenly the library is working with 
garbage, with no way to know that it happened.

So now let's be specific to see how your suggestion doesn't solve the problem 
(I actually had tried it). Consider the OOB operating asynchronously. In the 
first case, where the opal_buffer_t is created via OBJ_NEW, we can point the 
message system at the data field in the buffer and just OBJ_RETAIN it.

However, in the second case, the OBJ_RETAIN won't work - the calling code 
releases the data area, but the OOB has no idea that happened. So the pointer 
to the data field actually isn't valid any more, but there is no way to detect 
it. Likewise, looking for opal_object_t info is useless as the fields no longer 
exist.

Yes, I realize this is an incorrect program - the caller isn't allowed to 
release data until the async operation concludes. But the fact that a library 
function, especially one that is hidden down low in the code base, is async may 
not be immediately apparent. The resulting "bug" is extremely hard to chase 
down, especially as it almost inevitably is exposed as a race condition.

The goal of this RFC was to make such problems immediately apparent. Perhaps 
one solution would be to get rid of OBJ_DESTRUCT and just have everyone use 
OBJ_RELEASE, augmented with a flag to indicate whether or not to free the 
backing memory. Another might be to have OBJ_DESTRUCT respect ref counts, but 
Jeff, Brian and I didn't like that one. The proposed solution was another way 
to approach it that would force the above scenario to be recognized only during 
debug builds, thus allowing it to be identified and corrected.

Your example use-case is certainly something I hadn't considered - so perhaps 
we'll have to think of another way to detect my situation while still allowing 
what you do, or perhaps add a new OBJ_CONSTRUCT_WITH_RETAIN (or whatever) macro 
for use in your use-case that corrects the ref count?


> 
>  George.
> 
> PS: The second patch (ref count == 1 in OBJ_DESTRUCT) is trivial but 
> reasonable.
> 
> On Mar 7, 2013, at 22:10 , Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote:
> 
>> WHAT: Simple patch that will fail an assert() if you OBJ_CONSTRUCT an object 
>> and its ref count goes to 0 (in debug builds only).
>> 
>> WHY: To catch bugs.
>> 
>> WHERE: opal/class/opal_class.h
>> 
>> WHEN: Timeout of 1 week -- COB, Thurs, 14 Mar, 2013
>> 
>> MORE DETAIL:
>> 
>> On the call on Tuesday, we talked about some ideas for catching bugs with 
>> respect to object reference counting.  After the call, Brian, Ralph, and I 
>> came up with two useful asserts to help catch bugs (in debug builds only):
>> 
>> 1. If you OBJ_CONSTRUCT an object, its ref count should never go to zero.
>> 2. When a object is destroyed, its refcount should be exactly 1.
>> 
>> This RFC is for #1.  The attached patch doesn't seem to cause any problems 
>> (and we didn't expect it to).  But it's a good addition to the other 
>> asserts() that are already in the object code already.
>> 
>> As for #2, Ralph has previously found bugs in the ORTE layer that would have 
>> been much easier to track down if #2 were in place.  I'll send an RFC for #2 
>> when I have managed to fix all the problems that it has found in the OMPI 
>> layer...  :-)
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> Index: opal/class/opal_object.h
>> ===================================================================
>> --- opal/class/opal_object.h (revision 28147)
>> +++ opal/class/opal_object.h (working copy)
>> @@ -169,7 +169,7 @@
>> * @param NAME   Name of the class to initialize
>> */
>> #if OPAL_ENABLE_DEBUG
>> -#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, 
>> OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ }
>> +#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, 1, 
>> OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ }
>> #else
>> #define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OBJ_CLASS(BASE_CLASS), 1 }
>> #endif
>> @@ -184,6 +184,10 @@
>>    /** Magic ID -- want this to be the very first item in the
>>        struct's memory */
>>    uint64_t obj_magic_id;
>> +    /* flag whether this was initialized using construct
>> +     * versus obj_new
>> +     */
>> +    bool constructed;
>> #endif
>>    opal_class_t *obj_class;            /**< class descriptor */
>>    volatile int32_t obj_reference_count;   /**< reference count */
>> @@ -252,6 +256,7 @@
>>    object->obj_magic_id = OPAL_OBJ_MAGIC_ID;
>>    object->cls_init_file_name = file;
>>    object->cls_init_lineno = line;
>> +    object->constructed = false;
>>    return object;
>> }
>> #define OBJ_NEW(type)                                   \
>> @@ -313,6 +318,8 @@
>>        assert(NULL != ((opal_object_t *) (object))->obj_class);        \
>>        assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *) 
>> (object))->obj_magic_id); \
>>        if (0 == opal_obj_update((opal_object_t *) (object), -1)) {     \
>> +            /* constructed objects are not allowed to reach 0 */        \
>> +            assert(!(((opal_object_t *) (object))->constructed));       \
>>            OBJ_SET_MAGIC_ID((object), 0);                              \
>>            opal_obj_run_destructors((opal_object_t *) (object));       \
>>            OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>> @@ -344,6 +351,7 @@
>>    OBJ_CONSTRUCT_INTERNAL((object), OBJ_CLASS(type));          \
>> } while (0)
>> 
>> +#if OPAL_ENABLE_DEBUG
>> #define OBJ_CONSTRUCT_INTERNAL(object, type)                        \
>> do {                                                                \
>>    OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID);              \
>> @@ -352,11 +360,24 @@
>>    }                                                               \
>>    ((opal_object_t *) (object))->obj_class = (type);               \
>>    ((opal_object_t *) (object))->obj_reference_count = 1;          \
>> +    ((opal_object_t *) (object))->constructed = true;               \
>>    opal_obj_run_constructors((opal_object_t *) (object));          \
>>    OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>> } while (0)
>> +#else
>> +#define OBJ_CONSTRUCT_INTERNAL(object, type)                        \
>> +do {                                                                \
>> +    OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID);              \
>> +    if (0 == (type)->cls_initialized) {                             \
>> +        opal_class_initialize((type));                              \
>> +    }                                                               \
>> +    ((opal_object_t *) (object))->obj_class = (type);               \
>> +    ((opal_object_t *) (object))->obj_reference_count = 1;          \
>> +    opal_obj_run_constructors((opal_object_t *) (object));          \
>> +    OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \
>> +} while (0)
>> +#endif
>> 
>> -
>> /**
>> * Destruct (finalize) an object that is not dynamically allocated.
>> *
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to