Isn't that what assert()'s are for? :-) The point is that for Ralph's case, the #2 assert would cause a notification *right at the point of the error* (i.e., the errant DESTRUCT). The other tools you mention all cause notifications long after the DESTRUCT (i.e., when the freed memory is used, at which time it may be totally jumbled/corrupt/the FILE/line info gone). You know how hard it can be to track down memory corruption; having a notification right at the point of the error is a Good Thing.
Don't forget that we have lots of other asserts in the OBJ system (in the debug case only, of course). And per my other mail, if I find legitimate cases where the destructors are invoked where refcount!=1, we'll figure something else out. On Mar 8, 2013, at 7:22 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > I'm sorry Ralph, I'm puzzled by your approach. You knowingly use a broken > example to justify a patch that under correct/consistent usage solve a > non-issue? > > Why can't you use any decent memory debugger (fence, purify or valgrind) to > identify the issue you describe below? > > George. > > On Mar 8, 2013, at 02:19 , Ralph Castain <r...@open-mpi.org> wrote: > >> >> On Mar 7, 2013, at 4:37 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >> >>> Please refrain from doing so, the assumption #1 this patch is based on is >>> false. First, OBJ_CONSTRUCT can be run to construct a specific type of >>> object in a preallocated memory region (not only on the stack or heap). In >>> fact, it is the only way we can dynamically initialize an object in a >>> memory allocated outside the OBJ_NEW system. Second, OBJ_CONSTRUCT is not >>> necessarily matched with an OBJ_DESTRUCT, it work just fine with >>> OBJ_RELEASE. In fact I use these feature in several places. >>> >>> An example will be a memory region without a predefined size, that I >>> manipulate as opal_list_item_t. This fragment gets allocated when it's size >>> is know, then gets OBJ_CONSTRUCT'ed and then used. The reference count is >>> playing its role, when nobody is using the object anymore, it will be >>> automatically released. With the change you propose such usage will be >>> prohibited. >>> >>> The feature you are looking for, the one that might have shorten Ralph's >>> debugging time, is already in the opal_object_t. One should use the >>> cls_init_file_name and cls_init_lineno fields to see where the object was >>> first initialized as these fields are set either by the OBJ_NEW or by the >>> OBJ_CONSTRUCT call. >> >> Not exactly. Consider the case where we have a library - e.g., ORTE. Down in >> the library, perhaps several function calls down, we receive a pointer to an >> object. The library, not knowing any better, uses OBJ_RETAIN to indicate >> that this object is being used and therefore should not be released. It then >> returns and allows an async procedure to run. >> >> Up above, one caller to the library uses OBJ_NEW to create the object. Thus, >> the reference count system is in-play and governs when the data goes away. >> >> However, in another place, the caller uses OBJ_CONSTRUCT to initialize the >> object, and OBJ_DESTRUCTs it when the high-level call returns. In this case, >> the reference count system is ignored - OBJ_DESTRUCT destroys the object >> regardless of the reference count. So suddenly the library is working with >> garbage, with no way to know that it happened. >> >> So now let's be specific to see how your suggestion doesn't solve the >> problem (I actually had tried it). Consider the OOB operating >> asynchronously. In the first case, where the opal_buffer_t is created via >> OBJ_NEW, we can point the message system at the data field in the buffer and >> just OBJ_RETAIN it. >> >> However, in the second case, the OBJ_RETAIN won't work - the calling code >> releases the data area, but the OOB has no idea that happened. So the >> pointer to the data field actually isn't valid any more, but there is no way >> to detect it. Likewise, looking for opal_object_t info is useless as the >> fields no longer exist. >> >> Yes, I realize this is an incorrect program - the caller isn't allowed to >> release data until the async operation concludes. But the fact that a >> library function, especially one that is hidden down low in the code base, >> is async may not be immediately apparent. The resulting "bug" is extremely >> hard to chase down, especially as it almost inevitably is exposed as a race >> condition. >> >> The goal of this RFC was to make such problems immediately apparent. Perhaps >> one solution would be to get rid of OBJ_DESTRUCT and just have everyone use >> OBJ_RELEASE, augmented with a flag to indicate whether or not to free the >> backing memory. Another might be to have OBJ_DESTRUCT respect ref counts, >> but Jeff, Brian and I didn't like that one. The proposed solution was >> another way to approach it that would force the above scenario to be >> recognized only during debug builds, thus allowing it to be identified and >> corrected. >> >> Your example use-case is certainly something I hadn't considered - so >> perhaps we'll have to think of another way to detect my situation while >> still allowing what you do, or perhaps add a new OBJ_CONSTRUCT_WITH_RETAIN >> (or whatever) macro for use in your use-case that corrects the ref count? >> >> >>> >>> George. >>> >>> PS: The second patch (ref count == 1 in OBJ_DESTRUCT) is trivial but >>> reasonable. >>> >>> On Mar 7, 2013, at 22:10 , Jeff Squyres (jsquyres) <jsquy...@cisco.com> >>> wrote: >>> >>>> WHAT: Simple patch that will fail an assert() if you OBJ_CONSTRUCT an >>>> object and its ref count goes to 0 (in debug builds only). >>>> >>>> WHY: To catch bugs. >>>> >>>> WHERE: opal/class/opal_class.h >>>> >>>> WHEN: Timeout of 1 week -- COB, Thurs, 14 Mar, 2013 >>>> >>>> MORE DETAIL: >>>> >>>> On the call on Tuesday, we talked about some ideas for catching bugs with >>>> respect to object reference counting. After the call, Brian, Ralph, and I >>>> came up with two useful asserts to help catch bugs (in debug builds only): >>>> >>>> 1. If you OBJ_CONSTRUCT an object, its ref count should never go to zero. >>>> 2. When a object is destroyed, its refcount should be exactly 1. >>>> >>>> This RFC is for #1. The attached patch doesn't seem to cause any problems >>>> (and we didn't expect it to). But it's a good addition to the other >>>> asserts() that are already in the object code already. >>>> >>>> As for #2, Ralph has previously found bugs in the ORTE layer that would >>>> have been much easier to track down if #2 were in place. I'll send an RFC >>>> for #2 when I have managed to fix all the problems that it has found in >>>> the OMPI layer... :-) >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> Index: opal/class/opal_object.h >>>> =================================================================== >>>> --- opal/class/opal_object.h (revision 28147) >>>> +++ opal/class/opal_object.h (working copy) >>>> @@ -169,7 +169,7 @@ >>>> * @param NAME Name of the class to initialize >>>> */ >>>> #if OPAL_ENABLE_DEBUG >>>> -#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, >>>> OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ } >>>> +#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, 1, >>>> OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ } >>>> #else >>>> #define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OBJ_CLASS(BASE_CLASS), 1 } >>>> #endif >>>> @@ -184,6 +184,10 @@ >>>> /** Magic ID -- want this to be the very first item in the >>>> struct's memory */ >>>> uint64_t obj_magic_id; >>>> + /* flag whether this was initialized using construct >>>> + * versus obj_new >>>> + */ >>>> + bool constructed; >>>> #endif >>>> opal_class_t *obj_class; /**< class descriptor */ >>>> volatile int32_t obj_reference_count; /**< reference count */ >>>> @@ -252,6 +256,7 @@ >>>> object->obj_magic_id = OPAL_OBJ_MAGIC_ID; >>>> object->cls_init_file_name = file; >>>> object->cls_init_lineno = line; >>>> + object->constructed = false; >>>> return object; >>>> } >>>> #define OBJ_NEW(type) \ >>>> @@ -313,6 +318,8 @@ >>>> assert(NULL != ((opal_object_t *) (object))->obj_class); \ >>>> assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *) >>>> (object))->obj_magic_id); \ >>>> if (0 == opal_obj_update((opal_object_t *) (object), -1)) { \ >>>> + /* constructed objects are not allowed to reach 0 */ \ >>>> + assert(!(((opal_object_t *) (object))->constructed)); \ >>>> OBJ_SET_MAGIC_ID((object), 0); \ >>>> opal_obj_run_destructors((opal_object_t *) (object)); \ >>>> OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \ >>>> @@ -344,6 +351,7 @@ >>>> OBJ_CONSTRUCT_INTERNAL((object), OBJ_CLASS(type)); \ >>>> } while (0) >>>> >>>> +#if OPAL_ENABLE_DEBUG >>>> #define OBJ_CONSTRUCT_INTERNAL(object, type) \ >>>> do { \ >>>> OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID); \ >>>> @@ -352,11 +360,24 @@ >>>> } \ >>>> ((opal_object_t *) (object))->obj_class = (type); \ >>>> ((opal_object_t *) (object))->obj_reference_count = 1; \ >>>> + ((opal_object_t *) (object))->constructed = true; \ >>>> opal_obj_run_constructors((opal_object_t *) (object)); \ >>>> OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \ >>>> } while (0) >>>> +#else >>>> +#define OBJ_CONSTRUCT_INTERNAL(object, type) \ >>>> +do { \ >>>> + OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID); \ >>>> + if (0 == (type)->cls_initialized) { \ >>>> + opal_class_initialize((type)); \ >>>> + } \ >>>> + ((opal_object_t *) (object))->obj_class = (type); \ >>>> + ((opal_object_t *) (object))->obj_reference_count = 1; \ >>>> + opal_obj_run_constructors((opal_object_t *) (object)); \ >>>> + OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \ >>>> +} while (0) >>>> +#endif >>>> >>>> - >>>> /** >>>> * Destruct (finalize) an object that is not dynamically allocated. >>>> * >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/