On Mar 8, 2013, at 15:10 , Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote:
> Isn't that what assert()'s are for? :-) > > The point is that for Ralph's case, the #2 assert would cause a notification > *right at the point of the error* (i.e., the errant DESTRUCT). The other > tools you mention all cause notifications long after the DESTRUCT (i.e., when > the freed memory is used, at which time it may be totally jumbled/corrupt/the > FILE/line info gone). You know how hard it can be to track down memory > corruption; having a notification right at the point of the error is a Good > Thing. This is indeed true. However, a tool like valgrind keep trace of where the memory was allocated and freed, so this might provide enough info to identify and fix the issue. > Don't forget that we have lots of other asserts in the OBJ system (in the > debug case only, of course). > > And per my other mail, if I find legitimate cases where the destructors are > invoked where refcount!=1, we'll figure something else out. I am not arguing about the validity of the assert, but about the fact that now we can't just OBJ_DESTRUCT something, we are required to first set it's refcount to 1. I have a more advanced use case for you. Based on the MPI standard, MPI_Finalize can be called while the user still has non-complete requests returned by any of the non-blocking calls (there are some drawbacks of course, but this is not specifically prohibited). Thus, these requests will not have a ref count to 1, so they will not be able to be destructed. This is exactly what our code is doing today: pml_base_close.c:58: OBJ_DESTRUCT(&mca_pml_base_send_requests); pml_base_close.c:59: OBJ_DESTRUCT(&mca_pml_base_recv_requests); and then ompi_freelist.c:86. George. > > > > On Mar 8, 2013, at 7:22 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > >> I'm sorry Ralph, I'm puzzled by your approach. You knowingly use a broken >> example to justify a patch that under correct/consistent usage solve a >> non-issue? >> >> Why can't you use any decent memory debugger (fence, purify or valgrind) to >> identify the issue you describe below? >> >> George. >> >> On Mar 8, 2013, at 02:19 , Ralph Castain <r...@open-mpi.org> wrote: >> >>> >>> On Mar 7, 2013, at 4:37 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >>> >>>> Please refrain from doing so, the assumption #1 this patch is based on is >>>> false. First, OBJ_CONSTRUCT can be run to construct a specific type of >>>> object in a preallocated memory region (not only on the stack or heap). In >>>> fact, it is the only way we can dynamically initialize an object in a >>>> memory allocated outside the OBJ_NEW system. Second, OBJ_CONSTRUCT is not >>>> necessarily matched with an OBJ_DESTRUCT, it work just fine with >>>> OBJ_RELEASE. In fact I use these feature in several places. >>>> >>>> An example will be a memory region without a predefined size, that I >>>> manipulate as opal_list_item_t. This fragment gets allocated when it's >>>> size is know, then gets OBJ_CONSTRUCT'ed and then used. The reference >>>> count is playing its role, when nobody is using the object anymore, it >>>> will be automatically released. With the change you propose such usage >>>> will be prohibited. >>>> >>>> The feature you are looking for, the one that might have shorten Ralph's >>>> debugging time, is already in the opal_object_t. One should use the >>>> cls_init_file_name and cls_init_lineno fields to see where the object was >>>> first initialized as these fields are set either by the OBJ_NEW or by the >>>> OBJ_CONSTRUCT call. >>> >>> Not exactly. Consider the case where we have a library - e.g., ORTE. Down >>> in the library, perhaps several function calls down, we receive a pointer >>> to an object. The library, not knowing any better, uses OBJ_RETAIN to >>> indicate that this object is being used and therefore should not be >>> released. It then returns and allows an async procedure to run. >>> >>> Up above, one caller to the library uses OBJ_NEW to create the object. >>> Thus, the reference count system is in-play and governs when the data goes >>> away. >>> >>> However, in another place, the caller uses OBJ_CONSTRUCT to initialize the >>> object, and OBJ_DESTRUCTs it when the high-level call returns. In this >>> case, the reference count system is ignored - OBJ_DESTRUCT destroys the >>> object regardless of the reference count. So suddenly the library is >>> working with garbage, with no way to know that it happened. >>> >>> So now let's be specific to see how your suggestion doesn't solve the >>> problem (I actually had tried it). Consider the OOB operating >>> asynchronously. In the first case, where the opal_buffer_t is created via >>> OBJ_NEW, we can point the message system at the data field in the buffer >>> and just OBJ_RETAIN it. >>> >>> However, in the second case, the OBJ_RETAIN won't work - the calling code >>> releases the data area, but the OOB has no idea that happened. So the >>> pointer to the data field actually isn't valid any more, but there is no >>> way to detect it. Likewise, looking for opal_object_t info is useless as >>> the fields no longer exist. >>> >>> Yes, I realize this is an incorrect program - the caller isn't allowed to >>> release data until the async operation concludes. But the fact that a >>> library function, especially one that is hidden down low in the code base, >>> is async may not be immediately apparent. The resulting "bug" is extremely >>> hard to chase down, especially as it almost inevitably is exposed as a race >>> condition. >>> >>> The goal of this RFC was to make such problems immediately apparent. >>> Perhaps one solution would be to get rid of OBJ_DESTRUCT and just have >>> everyone use OBJ_RELEASE, augmented with a flag to indicate whether or not >>> to free the backing memory. Another might be to have OBJ_DESTRUCT respect >>> ref counts, but Jeff, Brian and I didn't like that one. The proposed >>> solution was another way to approach it that would force the above scenario >>> to be recognized only during debug builds, thus allowing it to be >>> identified and corrected. >>> >>> Your example use-case is certainly something I hadn't considered - so >>> perhaps we'll have to think of another way to detect my situation while >>> still allowing what you do, or perhaps add a new OBJ_CONSTRUCT_WITH_RETAIN >>> (or whatever) macro for use in your use-case that corrects the ref count? >>> >>> >>>> >>>> George. >>>> >>>> PS: The second patch (ref count == 1 in OBJ_DESTRUCT) is trivial but >>>> reasonable. >>>> >>>> On Mar 7, 2013, at 22:10 , Jeff Squyres (jsquyres) <jsquy...@cisco.com> >>>> wrote: >>>> >>>>> WHAT: Simple patch that will fail an assert() if you OBJ_CONSTRUCT an >>>>> object and its ref count goes to 0 (in debug builds only). >>>>> >>>>> WHY: To catch bugs. >>>>> >>>>> WHERE: opal/class/opal_class.h >>>>> >>>>> WHEN: Timeout of 1 week -- COB, Thurs, 14 Mar, 2013 >>>>> >>>>> MORE DETAIL: >>>>> >>>>> On the call on Tuesday, we talked about some ideas for catching bugs with >>>>> respect to object reference counting. After the call, Brian, Ralph, and >>>>> I came up with two useful asserts to help catch bugs (in debug builds >>>>> only): >>>>> >>>>> 1. If you OBJ_CONSTRUCT an object, its ref count should never go to zero. >>>>> 2. When a object is destroyed, its refcount should be exactly 1. >>>>> >>>>> This RFC is for #1. The attached patch doesn't seem to cause any >>>>> problems (and we didn't expect it to). But it's a good addition to the >>>>> other asserts() that are already in the object code already. >>>>> >>>>> As for #2, Ralph has previously found bugs in the ORTE layer that would >>>>> have been much easier to track down if #2 were in place. I'll send an >>>>> RFC for #2 when I have managed to fix all the problems that it has found >>>>> in the OMPI layer... :-) >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> Index: opal/class/opal_object.h >>>>> =================================================================== >>>>> --- opal/class/opal_object.h (revision 28147) >>>>> +++ opal/class/opal_object.h (working copy) >>>>> @@ -169,7 +169,7 @@ >>>>> * @param NAME Name of the class to initialize >>>>> */ >>>>> #if OPAL_ENABLE_DEBUG >>>>> -#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, >>>>> OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ } >>>>> +#define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OPAL_OBJ_MAGIC_ID, 1, >>>>> OBJ_CLASS(BASE_CLASS), 1, __FILE__, __LINE__ } >>>>> #else >>>>> #define OPAL_OBJ_STATIC_INIT(BASE_CLASS) { OBJ_CLASS(BASE_CLASS), 1 } >>>>> #endif >>>>> @@ -184,6 +184,10 @@ >>>>> /** Magic ID -- want this to be the very first item in the >>>>> struct's memory */ >>>>> uint64_t obj_magic_id; >>>>> + /* flag whether this was initialized using construct >>>>> + * versus obj_new >>>>> + */ >>>>> + bool constructed; >>>>> #endif >>>>> opal_class_t *obj_class; /**< class descriptor */ >>>>> volatile int32_t obj_reference_count; /**< reference count */ >>>>> @@ -252,6 +256,7 @@ >>>>> object->obj_magic_id = OPAL_OBJ_MAGIC_ID; >>>>> object->cls_init_file_name = file; >>>>> object->cls_init_lineno = line; >>>>> + object->constructed = false; >>>>> return object; >>>>> } >>>>> #define OBJ_NEW(type) \ >>>>> @@ -313,6 +318,8 @@ >>>>> assert(NULL != ((opal_object_t *) (object))->obj_class); \ >>>>> assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *) >>>>> (object))->obj_magic_id); \ >>>>> if (0 == opal_obj_update((opal_object_t *) (object), -1)) { \ >>>>> + /* constructed objects are not allowed to reach 0 */ \ >>>>> + assert(!(((opal_object_t *) (object))->constructed)); \ >>>>> OBJ_SET_MAGIC_ID((object), 0); \ >>>>> opal_obj_run_destructors((opal_object_t *) (object)); \ >>>>> OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \ >>>>> @@ -344,6 +351,7 @@ >>>>> OBJ_CONSTRUCT_INTERNAL((object), OBJ_CLASS(type)); \ >>>>> } while (0) >>>>> >>>>> +#if OPAL_ENABLE_DEBUG >>>>> #define OBJ_CONSTRUCT_INTERNAL(object, type) \ >>>>> do { \ >>>>> OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID); \ >>>>> @@ -352,11 +360,24 @@ >>>>> } \ >>>>> ((opal_object_t *) (object))->obj_class = (type); \ >>>>> ((opal_object_t *) (object))->obj_reference_count = 1; \ >>>>> + ((opal_object_t *) (object))->constructed = true; \ >>>>> opal_obj_run_constructors((opal_object_t *) (object)); \ >>>>> OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \ >>>>> } while (0) >>>>> +#else >>>>> +#define OBJ_CONSTRUCT_INTERNAL(object, type) \ >>>>> +do { \ >>>>> + OBJ_SET_MAGIC_ID((object), OPAL_OBJ_MAGIC_ID); \ >>>>> + if (0 == (type)->cls_initialized) { \ >>>>> + opal_class_initialize((type)); \ >>>>> + } \ >>>>> + ((opal_object_t *) (object))->obj_class = (type); \ >>>>> + ((opal_object_t *) (object))->obj_reference_count = 1; \ >>>>> + opal_obj_run_constructors((opal_object_t *) (object)); \ >>>>> + OBJ_REMEMBER_FILE_AND_LINENO( object, __FILE__, __LINE__ ); \ >>>>> +} while (0) >>>>> +#endif >>>>> >>>>> - >>>>> /** >>>>> * Destruct (finalize) an object that is not dynamically allocated. >>>>> * >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel