I'm sorry Ralph, I'm puzzled by your approach. You knowingly use a broken example to justify a patch that under correct/consistent usage solve a non-issue?
Why can't you use any decent memory debugger (fence, purify or valgrind) to identify the issue you describe below? George. On Mar 8, 2013, at 02:19 , Ralph Castain <r...@open-mpi.org> wrote: > > On Mar 7, 2013, at 4:37 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > >> Please refrain from doing so, the assumption #1 this patch is based on is >> false. First, OBJ_CONSTRUCT can be run to construct a specific type of >> object in a preallocated memory region (not only on the stack or heap). In >> fact, it is the only way we can dynamically initialize an object in a memory >> allocated outside the OBJ_NEW system. Second, OBJ_CONSTRUCT is not >> necessarily matched with an OBJ_DESTRUCT, it work just fine with >> OBJ_RELEASE. In fact I use these feature in several places. >> >> An example will be a memory region without a predefined size, that I >> manipulate as opal_list_item_t. This fragment gets allocated when it's size >> is know, then gets OBJ_CONSTRUCT'ed and then used. The reference count is >> playing its role, when nobody is using the object anymore, it will be >> automatically released. With the change you propose such usage will be >> prohibited. >> >> The feature you are looking for, the one that might have shorten Ralph's >> debugging time, is already in the opal_object_t. One should use the >> cls_init_file_name and cls_init_lineno fields to see where the object was >> first initialized as these fields are set either by the OBJ_NEW or by the >> OBJ_CONSTRUCT call. > > Not exactly. Consider the case where we have a library - e.g., ORTE. Down in > the library, perhaps several function calls down, we receive a pointer to an > object. The library, not knowing any better, uses OBJ_RETAIN to indicate that > this object is being used and therefore should not be released. It then > returns and allows an async procedure to run. > > Up above, one caller to the library uses OBJ_NEW to create the object. Thus, > the reference count system is in-play and governs when the data goes away. > > However, in another place, the caller uses OBJ_CONSTRUCT to initialize the > object, and OBJ_DESTRUCTs it when the high-level call returns. In this case, > the reference count system is ignored - OBJ_DESTRUCT destroys the object > regardless of the reference count. So suddenly the library is working with > garbage, with no way to know that it happened. > > So now let's be specific to see how your suggestion doesn't solve the problem > (I actually had tried it). Consider the OOB operating asynchronously. In the > first case, where the opal_buffer_t is created via OBJ_NEW, we can point the > message system at the data field in the buffer and just OBJ_RETAIN it. > > However, in the second case, the OBJ_RETAIN won't work - the calling code > releases the data area, but the OOB has no idea that happened. So the pointer > to the data field actually isn't valid any more, but there is no way to > detect it. Likewise, looking for opal_object_t info is useless as the fields > no longer exist. > > Yes, I realize this is an incorrect program - the caller isn't allowed to > release data until the async operation concludes. But the fact that a library > function, especially one that is hidden down low in the code base, is async > may not be immediately apparent. The resulting "bug" is extremely hard to > chase down, especially as it almost inevitably is exposed as a race condition. > > The goal of this RFC was to make such problems immediately apparent. Perhaps > one solution would be to get rid of OBJ_DESTRUCT and just have everyone use > OBJ_RELEASE, augmented with a flag to indicate whether or not to free the > backing memory. Another might be to have OBJ_DESTRUCT respect ref counts, but > Jeff, Brian and I didn't like that one. The proposed solution was another way > to approach it that would force the above scenario to be recognized only > during debug builds, thus allowing it to be identified and corrected. > > Your example use-case is certainly something I hadn't considered - so perhaps > we'll have to think of another way to detect my situation while still > allowing what you do, or perhaps add a new OBJ_CONSTRUCT_WITH_RETAIN (or > whatever) macro for use in your use-case that corrects the ref count? > > >> >> George. >> >> PS: The second patch (ref count == 1 in OBJ_DESTRUCT) is trivial but >> reasonable. >> >> On Mar 7, 2013, at 22:10 , Jeff Squyres (jsquyres) <jsquy...@cisco.com> >> wrote: >> >>> WHAT: Simple patch that will fail an assert() if you OBJ_CONSTRUCT an >>> object and its ref count goes to 0 (in debug builds only). >>> >>> WHY: To catch bugs. >>> >>> WHERE: opal/class/opal_class.h >>> >>> WHEN: Timeout of 1 week -- COB, Thurs, 14 Mar, 2013 >>> >>> MORE DETAIL: >>> >>> On the call on Tuesday, we talked about some ideas for catching bugs with >>> respect to object reference counting. After the call, Brian, Ralph, and I >>> came up with two useful asserts to help catch bugs (in debug builds only): >>> >>> 1. If you OBJ_CONSTRUCT an object, its ref count should never go to zero. >>> 2. When a object is destroyed, its refcount should be exactly 1. >>> >>> This RFC is for #1. The attached patch doesn't seem to cause any problems >>> (and we didn't expect it to). But it's a good addition to the other >>> asserts() that are already in the object code already. >>> >>> As for #2, Ralph has previously found bugs in the ORTE layer that would >>> have been much easier to track down if #2 were in place. I'll send an RFC >>> for #2 when I have managed to fix all the problems that it has found in the >>> OMPI layer... :-) >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> Index: opal/class/opal_object.h >>> =================================================================== >>> --- opal/class/opal_object.h (revision 28147) >>> +++ opal/class/opal_object.h (working copy) >>> @@ -169,7 +169,7 @@ >>> * @param NAME Name of the class to initialize >>> */ >>> #if OPAL_ENABLE_DEBUG >>> -#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, >>> OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ } >>> +#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, 1, >>> OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ } >>> #else >>> #define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OBJ_CLASS(BASE_CLASS), 1 } >>> #endif >>> @@ -184,6 +184,10 @@ >>> /** Magic ID -- want this to be the very first item in the >>> struct's memory */ >>> uint64_t obj_magic_id; >>> + /* flag whether this was initialized using construct >>> + * versus obj_new >>> + */ >>> + bool constructed; >>> #endif >>> opal_class_t *obj_class; /**< class descriptor */ >>> volatile int32_t obj_reference_count; /**< reference count */ >>> @@ -252,6 +256,7 @@ >>> object->obj_magic_id = OPAL_OBJ_MAGIC_ID; >>> object->cls_init_file_name = file; >>> object->cls_init_lineno = line; >>> + object->constructed = false; >>> return object; >>> } >>> #define OBJ_NEW(type) \ >>> @@ -313,6 +318,8 @@ >>> assert(NULL != ((opal_object_t *) (object))->obj_class); \ >>> assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *) >>> (object))->obj_magic_id); \ >>> if (0 == opal_obj_update((opal_object_t *) (object), -1)) { \ >>> + /* constructed objects are not allowed to reach 0 */ \ >>> + assert(!(((opal_object_t *) (object))->constructed)); \ >>> OBJ_SET_MAGIC_ID((object), 0); \ >>> opal_obj_run_destructors((opal_object_t *) (object)); \ >>> OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \ >>> @@ -344,6 +351,7 @@ >>> OBJ_CONSTRUCT_INTERNAL((object), OBJ_CLASS(type)); \ >>> } while (0) >>> >>> +#if OPAL_ENABLE_DEBUG >>> #define OBJ_CONSTRUCT_INTERNAL(object, type) \ >>> do { \ >>> OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID); \ >>> @@ -352,11 +360,24 @@ >>> } \ >>> ((opal_object_t *) (object))->obj_class = (type); \ >>> ((opal_object_t *) (object))->obj_reference_count = 1; \ >>> + ((opal_object_t *) (object))->constructed = true; \ >>> opal_obj_run_constructors((opal_object_t *) (object)); \ >>> OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \ >>> } while (0) >>> +#else >>> +#define OBJ_CONSTRUCT_INTERNAL(object, type) \ >>> +do { \ >>> + OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID); \ >>> + if (0 == (type)->cls_initialized) { \ >>> + opal_class_initialize((type)); \ >>> + } \ >>> + ((opal_object_t *) (object))->obj_class = (type); \ >>> + ((opal_object_t *) (object))->obj_reference_count = 1; \ >>> + opal_obj_run_constructors((opal_object_t *) (object)); \ >>> + OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \ >>> +} while (0) >>> +#endif >>> >>> - >>> /** >>> * Destruct (finalize) an object that is not dynamically allocated. >>> * >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel