http://www.beowulf-underground.org/pipermail/pvfs2-developers/attachments/20101020/068fd427/pint-mgmt-opid-cleanups-0001.obj
Bart/Ben: Above is a link to the patch that was in the pvfs2-developers archive. It has alot more stuff in it than the patch that you just sent me. In addition, Sam made a comment about the pint-mgmt.c change indicating that memory would not be released unless other changes were made. So, I'm thinking that the changes I applied to OrangeFS included everything that Sam intended and not just a quick solution. Please take a look at this link and see if your code has these same changes. These changes may be contributing to your memory problem, but we also think there still may be something else as well. Becky On Thu, Jun 30, 2011 at 3:28 PM, Becky Ligon <[email protected]> wrote: > Thanks, Bart! I just found it in the pvfs2-developers archives as well!!!! > > It looks like I applied all of these patches to OrangeFS on 11/05/2010. Of > course, none of us realized that this patch would cause memory *not* to be > freed. So, I am still looking into what the solution should be! > > Becky > > > On Thu, Jun 30, 2011 at 3:20 PM, Bart Taylor <[email protected]> wrote: > >> Here you go Becky. I got this patch from Sam on 10/20/2010. >> >> Bart. >> >> >> >> >> On Thu, Jun 30, 2011 at 1:11 PM, Becky Ligon <[email protected]> wrote: >> >>> Ben: >>> >>> Do you have a copy of the original fix, "pint-mgmt-segv-fix.patch" that >>> Sam sent to you? If so, can you send it to me? >>> >>> Thanks, >>> Becky >>> >>> >>> On Thu, Jun 30, 2011 at 12:08 PM, Becky Ligon <[email protected]>wrote: >>> >>>> I think I made these changes due to a problem that we were having on >>>> our Palmetto system. Just a quick look leads me to believe that the change >>>> to PINT_context_test_all should not have freed the op_id; I suspect that >>>> this may prevent a cleanup of it later down the road. I will look at it >>>> closer to see if that is the case. The change to PINT_context_test is >>>> okay, >>>> I think. >>>> >>>> Becky >>>> >>>> >>>> On Thu, Jun 30, 2011 at 11:38 AM, Becky Ligon <[email protected]>wrote: >>>> >>>>> I remember making this change but I'm not certain if I found the >>>>> problem or was asked to add this to the repository. Seems like Sam Lang >>>>> was >>>>> helping someone with a problem like this and then I patched it. In any >>>>> case, I'll track it down and see what I can do. >>>>> >>>>> Becky >>>>> >>>>> >>>>> On Thu, Jun 30, 2011 at 11:20 AM, Benjamin Severs < >>>>> [email protected]> wrote: >>>>> >>>>>> On Wednesday, June 29, 2011 08:49:46 Michael Moore wrote: >>>>>> > Bart, >>>>>> > >>>>>> > This doesn't look like a leak in the "can't free memory because >>>>>> there isn't >>>>>> > a reference to it" sense. However, there is definitely memory usage >>>>>> growth, >>>>>> > although, I haven't been able to replicate usage with quite as steep >>>>>> a >>>>>> > curve as your graph. That graph was based on system memory usage? >>>>>> Would >>>>>> > you mind running the same test and collecting the output from >>>>>> > /proc/<PID>/statm at some interval while the test is running where >>>>>> PID is >>>>>> > the server process' PID? Heap profiling so far makes it look like >>>>>> the >>>>>> > increased usage is coming from libraries not due to memory >>>>>> allocations >>>>>> > within OrangeFS but no definitive answer yet. >>>>>> > >>>>>> > Has anyone seen this behavior in the past? >>>>>> > >>>>>> > Thanks, >>>>>> > Michael >>>>>> > >>>>>> >>>>>> I believe I tracked the commit down which introduced the leak. I can >>>>>> revert >>>>>> this commit and the leak appears to go away. I haven't yet figured >>>>>> out what is >>>>>> actually leaking though. >>>>>> >>>>>> The commit is: >>>>>> Author: bligon <bligon> >>>>>> Date: Fri Nov 5 16:09:27 2010 +0000 >>>>>> >>>>>> Corrects problem where the server tries to cancel a timed-out i/o >>>>>> operation but >>>>>> before the cancel can happen, the operation finishes; so, the >>>>>> cancel can't >>>>>> find the op id and subsequently caused a segfault.' >>>>>> >>>>>> >>>>>> diff --git a/src/common/mgmt/pint-context.c >>>>>> b/src/common/mgmt/pint-context.c >>>>>> index 0a07256..cda038c 100644 >>>>>> --- a/src/common/mgmt/pint-context.c >>>>>> +++ b/src/common/mgmt/pint-context.c >>>>>> @@ -422,10 +422,17 @@ int PINT_context_test_all(PINT_context_id >>>>>> context_id, >>>>>> >>>>>> for(; i < *count; ++i) >>>>>> { >>>>>> + struct PINT_op_entry *op_entry; >>>>>> qentry = user_ptrs[i]; >>>>>> ctx_entry = PINT_queue_entry_object( >>>>>> qentry, struct PINT_context_queue_entry, qentry); >>>>>> op_ids[i] = ctx_entry->op_id; >>>>>> + op_entry = id_gen_safe_lookup(ctx_entry->op_id); >>>>>> + if (op_entry) >>>>>> + { >>>>>> + id_gen_safe_unregister(op_entry->op.id); >>>>>> + free(op_entry); >>>>>> + } >>>>>> uptr = ctx_entry->user_ptr; >>>>>> errors[i] = ctx_entry->result; >>>>>> user_ptrs[i] = uptr; >>>>>> @@ -519,11 +526,19 @@ int PINT_context_test(PINT_context_id >>>>>> context_id, >>>>>> &op_id, &qentry, microsecs); >>>>>> if(ret == 0) >>>>>> { >>>>>> + struct PINT_op_entry *op_entry; >>>>>> entry = PINT_queue_entry_object( >>>>>> qentry, >>>>>> struct PINT_context_queue_entry, qentry); >>>>>> + op_entry = id_gen_safe_lookup(entry->op_id); >>>>>> + if (op_entry) >>>>>> + { >>>>>> + id_gen_safe_unregister(op_entry->op.id); >>>>>> + free(op_entry); >>>>>> + } >>>>>> *user_ptr = entry->user_ptr; >>>>>> *error = entry->result; >>>>>> + >>>>>> free(entry); >>>>>> } >>>>>> >>>>>> diff --git a/src/common/mgmt/pint-context.h >>>>>> b/src/common/mgmt/pint-context.h >>>>>> index 78dfb4b..147be5c 100644 >>>>>> --- a/src/common/mgmt/pint-context.h >>>>>> +++ b/src/common/mgmt/pint-context.h >>>>>> @@ -9,6 +9,7 @@ >>>>>> >>>>>> #include "pint-op.h" >>>>>> #include "pint-queue.h" >>>>>> +#include "quickhash.h" >>>>>> >>>>>> enum PINT_context_type >>>>>> { >>>>>> @@ -18,6 +19,19 @@ enum PINT_context_type >>>>>> >>>>>> typedef PVFS_id_gen_t PINT_context_id; >>>>>> >>>>>> +struct PINT_op_entry >>>>>> +{ >>>>>> + void *user_ptr; >>>>>> + PINT_operation_t op; >>>>>> + PVFS_id_gen_t wq_id; >>>>>> + PINT_context_id ctx_id; >>>>>> + PVFS_error error; >>>>>> + >>>>>> + struct qhash_head link; >>>>>> +}; >>>>>> + >>>>>> + >>>>>> + >>>>>> typedef int (*PINT_completion_callback)(PINT_context_id ctx_id, >>>>>> int count, >>>>>> PINT_op_id *op_ids, >>>>>> diff --git a/src/common/mgmt/pint-mgmt.c b/src/common/mgmt/pint-mgmt.c >>>>>> index fa52d8d..120a5a1 100644 >>>>>> --- a/src/common/mgmt/pint-mgmt.c >>>>>> +++ b/src/common/mgmt/pint-mgmt.c >>>>>> @@ -42,17 +42,6 @@ struct PINT_worker_s >>>>>> struct qlist_head link; >>>>>> }; >>>>>> >>>>>> -struct PINT_op_entry >>>>>> -{ >>>>>> - void *user_ptr; >>>>>> - PINT_operation_t op; >>>>>> - PVFS_id_gen_t wq_id; >>>>>> - PINT_context_id ctx_id; >>>>>> - PVFS_error error; >>>>>> - >>>>>> - struct qhash_head link; >>>>>> -}; >>>>>> - >>>>>> struct PINT_manager_s >>>>>> { >>>>>> PINT_context_id context; >>>>>> @@ -634,12 +623,15 @@ int PINT_manager_ctx_post(PINT_manager_t >>>>>> manager, >>>>>> return ret; >>>>>> } >>>>>> >>>>>> - id_gen_fast_register(&op_entry->op.id, op_entry); >>>>>> - >>>>>> gossip_debug(GOSSIP_MGMT_DEBUG, >>>>>> "[MGMT]: manager ops: adding op id: %llu\n", >>>>>> llu(op_entry->op.id)); >>>>>> gen_mutex_lock(&manager->mutex); >>>>>> + ret = id_gen_safe_register(&op_entry->op.id, op_entry); >>>>>> + if (ret < 0) >>>>>> + { >>>>>> + return ret; >>>>>> + } >>>>>> qhash_add(manager->ops, &op_entry->op.id, &op_entry->link); >>>>>> manager->op_count++; >>>>>> gen_mutex_unlock(&manager->mutex); >>>>>> @@ -657,6 +649,7 @@ int PINT_manager_ctx_post(PINT_manager_t manager, >>>>>> * operation and return the error >>>>>> */ >>>>>> gen_mutex_lock(&manager->mutex); >>>>>> + id_gen_safe_unregister(op_entry->op.id); >>>>>> qhash_search_and_remove(manager->ops, &op_entry->op.id); >>>>>> gen_mutex_unlock(&manager->mutex); >>>>>> >>>>>> @@ -1047,7 +1040,7 @@ int PINT_manager_test_op(PINT_manager_t manager, >>>>>> struct PINT_op_entry *entry; >>>>>> PINT_context_id context; >>>>>> >>>>>> - entry = id_gen_fast_lookup(op_id); >>>>>> + entry = id_gen_safe_lookup(op_id); >>>>>> if(!entry) >>>>>> { >>>>>> return -PVFS_EINVAL; >>>>>> @@ -1143,13 +1136,17 @@ int PINT_manager_cancel(PINT_manager_t >>>>>> manager, >>>>>> PINT_queue_id, >>>>>> PINT_operation_t *); >>>>>> >>>>>> - entry = id_gen_fast_lookup(op_id); >>>>>> + entry = id_gen_safe_lookup(op_id); >>>>>> if(!entry) >>>>>> { >>>>>> return -PVFS_EINVAL; >>>>>> } >>>>>> ret = PINT_manager_find_worker( >>>>>> manager, NULL, NULL, NULL, entry->wq_id, &worker, &queue_id); >>>>>> + if (ret != 0) >>>>>> + { >>>>>> + return ret; >>>>>> + } >>>>>> context = entry->ctx_id; >>>>>> op = &entry->op; >>>>>> >>>>>> @@ -1215,7 +1212,7 @@ int PINT_manager_wait_op(PINT_manager_t manager, >>>>>> int msecs_remaining; >>>>>> struct timeval last, now; >>>>>> >>>>>> - entry = id_gen_fast_lookup(op_id); >>>>>> + entry = id_gen_safe_lookup(op_id); >>>>>> if(!entry) >>>>>> { >>>>>> return -PVFS_EINVAL; >>>>>> @@ -1451,9 +1448,6 @@ int PINT_manager_complete_op(PINT_manager_t >>>>>> manager, >>>>>> * since blocking calls never add the op >>>>>> * to the table >>>>>> */ >>>>>> - >>>>>> - /* free the op entry */ >>>>>> - free(entry); >>>>>> return ret; >>>>>> } >>>>>> >>>>>> >>>>>> -- >>>>>> Benjamin Severs >>>>>> _______________________________________________ >>>>>> Pvfs2-developers mailing list >>>>>> [email protected] >>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Becky Ligon >>>>> OrangeFS Support and Development >>>>> Omnibond Systems >>>>> Anderson, South Carolina >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Becky Ligon >>>> OrangeFS Support and Development >>>> Omnibond Systems >>>> Anderson, South Carolina >>>> >>>> >>>> >>> >>> >>> -- >>> Becky Ligon >>> OrangeFS Support and Development >>> Omnibond Systems >>> Anderson, South Carolina >>> >>> >>> >>> _______________________________________________ >>> Pvfs2-developers mailing list >>> [email protected] >>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers >>> >>> >> > > > -- > Becky Ligon > OrangeFS Support and Development > Omnibond Systems > Anderson, South Carolina > > > -- Becky Ligon OrangeFS Support and Development Omnibond Systems Anderson, South Carolina
_______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
