Just as a side note, it appears that this bug fix was never commited to the repository at all, not even the 2.8.2 branch.
Becky On Thu, Jun 30, 2011 at 4:42 PM, Becky Ligon <[email protected]> wrote: > > http://www.beowulf-underground.org/pipermail/pvfs2-developers/attachments/20101020/068fd427/pint-mgmt-opid-cleanups-0001.obj > > Bart/Ben: > > Above is a link to the patch that was in the pvfs2-developers archive. It > has alot more stuff in it than the patch that you just sent me. In > addition, Sam made a comment about the pint-mgmt.c change indicating that > memory would not be released unless other changes were made. So, I'm > thinking that the changes I applied to OrangeFS included everything that Sam > intended and not just a quick solution. Please take a look at this link and > see if your code has these same changes. These changes may be contributing > to your memory problem, but we also think there still may be something else > as well. > > Becky > > > On Thu, Jun 30, 2011 at 3:28 PM, Becky Ligon <[email protected]> wrote: > >> Thanks, Bart! I just found it in the pvfs2-developers archives as >> well!!!! >> >> It looks like I applied all of these patches to OrangeFS on 11/05/2010. >> Of course, none of us realized that this patch would cause memory *not* to >> be freed. So, I am still looking into what the solution should be! >> >> Becky >> >> >> On Thu, Jun 30, 2011 at 3:20 PM, Bart Taylor <[email protected]> wrote: >> >>> Here you go Becky. I got this patch from Sam on 10/20/2010. >>> >>> Bart. >>> >>> >>> >>> >>> On Thu, Jun 30, 2011 at 1:11 PM, Becky Ligon <[email protected]> wrote: >>> >>>> Ben: >>>> >>>> Do you have a copy of the original fix, "pint-mgmt-segv-fix.patch" that >>>> Sam sent to you? If so, can you send it to me? >>>> >>>> Thanks, >>>> Becky >>>> >>>> >>>> On Thu, Jun 30, 2011 at 12:08 PM, Becky Ligon <[email protected]>wrote: >>>> >>>>> I think I made these changes due to a problem that we were having on >>>>> our Palmetto system. Just a quick look leads me to believe that the >>>>> change >>>>> to PINT_context_test_all should not have freed the op_id; I suspect that >>>>> this may prevent a cleanup of it later down the road. I will look at it >>>>> closer to see if that is the case. The change to PINT_context_test is >>>>> okay, >>>>> I think. >>>>> >>>>> Becky >>>>> >>>>> >>>>> On Thu, Jun 30, 2011 at 11:38 AM, Becky Ligon <[email protected]>wrote: >>>>> >>>>>> I remember making this change but I'm not certain if I found the >>>>>> problem or was asked to add this to the repository. Seems like Sam Lang >>>>>> was >>>>>> helping someone with a problem like this and then I patched it. In any >>>>>> case, I'll track it down and see what I can do. >>>>>> >>>>>> Becky >>>>>> >>>>>> >>>>>> On Thu, Jun 30, 2011 at 11:20 AM, Benjamin Severs < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> On Wednesday, June 29, 2011 08:49:46 Michael Moore wrote: >>>>>>> > Bart, >>>>>>> > >>>>>>> > This doesn't look like a leak in the "can't free memory because >>>>>>> there isn't >>>>>>> > a reference to it" sense. However, there is definitely memory usage >>>>>>> growth, >>>>>>> > although, I haven't been able to replicate usage with quite as >>>>>>> steep a >>>>>>> > curve as your graph. That graph was based on system memory usage? >>>>>>> Would >>>>>>> > you mind running the same test and collecting the output from >>>>>>> > /proc/<PID>/statm at some interval while the test is running where >>>>>>> PID is >>>>>>> > the server process' PID? Heap profiling so far makes it look like >>>>>>> the >>>>>>> > increased usage is coming from libraries not due to memory >>>>>>> allocations >>>>>>> > within OrangeFS but no definitive answer yet. >>>>>>> > >>>>>>> > Has anyone seen this behavior in the past? >>>>>>> > >>>>>>> > Thanks, >>>>>>> > Michael >>>>>>> > >>>>>>> >>>>>>> I believe I tracked the commit down which introduced the leak. I can >>>>>>> revert >>>>>>> this commit and the leak appears to go away. I haven't yet figured >>>>>>> out what is >>>>>>> actually leaking though. >>>>>>> >>>>>>> The commit is: >>>>>>> Author: bligon <bligon> >>>>>>> Date: Fri Nov 5 16:09:27 2010 +0000 >>>>>>> >>>>>>> Corrects problem where the server tries to cancel a timed-out i/o >>>>>>> operation but >>>>>>> before the cancel can happen, the operation finishes; so, the >>>>>>> cancel can't >>>>>>> find the op id and subsequently caused a segfault.' >>>>>>> >>>>>>> >>>>>>> diff --git a/src/common/mgmt/pint-context.c >>>>>>> b/src/common/mgmt/pint-context.c >>>>>>> index 0a07256..cda038c 100644 >>>>>>> --- a/src/common/mgmt/pint-context.c >>>>>>> +++ b/src/common/mgmt/pint-context.c >>>>>>> @@ -422,10 +422,17 @@ int PINT_context_test_all(PINT_context_id >>>>>>> context_id, >>>>>>> >>>>>>> for(; i < *count; ++i) >>>>>>> { >>>>>>> + struct PINT_op_entry *op_entry; >>>>>>> qentry = user_ptrs[i]; >>>>>>> ctx_entry = PINT_queue_entry_object( >>>>>>> qentry, struct PINT_context_queue_entry, qentry); >>>>>>> op_ids[i] = ctx_entry->op_id; >>>>>>> + op_entry = id_gen_safe_lookup(ctx_entry->op_id); >>>>>>> + if (op_entry) >>>>>>> + { >>>>>>> + id_gen_safe_unregister(op_entry->op.id); >>>>>>> + free(op_entry); >>>>>>> + } >>>>>>> uptr = ctx_entry->user_ptr; >>>>>>> errors[i] = ctx_entry->result; >>>>>>> user_ptrs[i] = uptr; >>>>>>> @@ -519,11 +526,19 @@ int PINT_context_test(PINT_context_id >>>>>>> context_id, >>>>>>> &op_id, &qentry, microsecs); >>>>>>> if(ret == 0) >>>>>>> { >>>>>>> + struct PINT_op_entry *op_entry; >>>>>>> entry = PINT_queue_entry_object( >>>>>>> qentry, >>>>>>> struct PINT_context_queue_entry, qentry); >>>>>>> + op_entry = id_gen_safe_lookup(entry->op_id); >>>>>>> + if (op_entry) >>>>>>> + { >>>>>>> + id_gen_safe_unregister(op_entry->op.id); >>>>>>> + free(op_entry); >>>>>>> + } >>>>>>> *user_ptr = entry->user_ptr; >>>>>>> *error = entry->result; >>>>>>> + >>>>>>> free(entry); >>>>>>> } >>>>>>> >>>>>>> diff --git a/src/common/mgmt/pint-context.h >>>>>>> b/src/common/mgmt/pint-context.h >>>>>>> index 78dfb4b..147be5c 100644 >>>>>>> --- a/src/common/mgmt/pint-context.h >>>>>>> +++ b/src/common/mgmt/pint-context.h >>>>>>> @@ -9,6 +9,7 @@ >>>>>>> >>>>>>> #include "pint-op.h" >>>>>>> #include "pint-queue.h" >>>>>>> +#include "quickhash.h" >>>>>>> >>>>>>> enum PINT_context_type >>>>>>> { >>>>>>> @@ -18,6 +19,19 @@ enum PINT_context_type >>>>>>> >>>>>>> typedef PVFS_id_gen_t PINT_context_id; >>>>>>> >>>>>>> +struct PINT_op_entry >>>>>>> +{ >>>>>>> + void *user_ptr; >>>>>>> + PINT_operation_t op; >>>>>>> + PVFS_id_gen_t wq_id; >>>>>>> + PINT_context_id ctx_id; >>>>>>> + PVFS_error error; >>>>>>> + >>>>>>> + struct qhash_head link; >>>>>>> +}; >>>>>>> + >>>>>>> + >>>>>>> + >>>>>>> typedef int (*PINT_completion_callback)(PINT_context_id ctx_id, >>>>>>> int count, >>>>>>> PINT_op_id *op_ids, >>>>>>> diff --git a/src/common/mgmt/pint-mgmt.c >>>>>>> b/src/common/mgmt/pint-mgmt.c >>>>>>> index fa52d8d..120a5a1 100644 >>>>>>> --- a/src/common/mgmt/pint-mgmt.c >>>>>>> +++ b/src/common/mgmt/pint-mgmt.c >>>>>>> @@ -42,17 +42,6 @@ struct PINT_worker_s >>>>>>> struct qlist_head link; >>>>>>> }; >>>>>>> >>>>>>> -struct PINT_op_entry >>>>>>> -{ >>>>>>> - void *user_ptr; >>>>>>> - PINT_operation_t op; >>>>>>> - PVFS_id_gen_t wq_id; >>>>>>> - PINT_context_id ctx_id; >>>>>>> - PVFS_error error; >>>>>>> - >>>>>>> - struct qhash_head link; >>>>>>> -}; >>>>>>> - >>>>>>> struct PINT_manager_s >>>>>>> { >>>>>>> PINT_context_id context; >>>>>>> @@ -634,12 +623,15 @@ int PINT_manager_ctx_post(PINT_manager_t >>>>>>> manager, >>>>>>> return ret; >>>>>>> } >>>>>>> >>>>>>> - id_gen_fast_register(&op_entry->op.id, op_entry); >>>>>>> - >>>>>>> gossip_debug(GOSSIP_MGMT_DEBUG, >>>>>>> "[MGMT]: manager ops: adding op id: %llu\n", >>>>>>> llu(op_entry->op.id)); >>>>>>> gen_mutex_lock(&manager->mutex); >>>>>>> + ret = id_gen_safe_register(&op_entry->op.id, op_entry); >>>>>>> + if (ret < 0) >>>>>>> + { >>>>>>> + return ret; >>>>>>> + } >>>>>>> qhash_add(manager->ops, &op_entry->op.id, &op_entry->link); >>>>>>> manager->op_count++; >>>>>>> gen_mutex_unlock(&manager->mutex); >>>>>>> @@ -657,6 +649,7 @@ int PINT_manager_ctx_post(PINT_manager_t manager, >>>>>>> * operation and return the error >>>>>>> */ >>>>>>> gen_mutex_lock(&manager->mutex); >>>>>>> + id_gen_safe_unregister(op_entry->op.id); >>>>>>> qhash_search_and_remove(manager->ops, &op_entry->op.id); >>>>>>> gen_mutex_unlock(&manager->mutex); >>>>>>> >>>>>>> @@ -1047,7 +1040,7 @@ int PINT_manager_test_op(PINT_manager_t >>>>>>> manager, >>>>>>> struct PINT_op_entry *entry; >>>>>>> PINT_context_id context; >>>>>>> >>>>>>> - entry = id_gen_fast_lookup(op_id); >>>>>>> + entry = id_gen_safe_lookup(op_id); >>>>>>> if(!entry) >>>>>>> { >>>>>>> return -PVFS_EINVAL; >>>>>>> @@ -1143,13 +1136,17 @@ int PINT_manager_cancel(PINT_manager_t >>>>>>> manager, >>>>>>> PINT_queue_id, >>>>>>> PINT_operation_t *); >>>>>>> >>>>>>> - entry = id_gen_fast_lookup(op_id); >>>>>>> + entry = id_gen_safe_lookup(op_id); >>>>>>> if(!entry) >>>>>>> { >>>>>>> return -PVFS_EINVAL; >>>>>>> } >>>>>>> ret = PINT_manager_find_worker( >>>>>>> manager, NULL, NULL, NULL, entry->wq_id, &worker, &queue_id); >>>>>>> + if (ret != 0) >>>>>>> + { >>>>>>> + return ret; >>>>>>> + } >>>>>>> context = entry->ctx_id; >>>>>>> op = &entry->op; >>>>>>> >>>>>>> @@ -1215,7 +1212,7 @@ int PINT_manager_wait_op(PINT_manager_t >>>>>>> manager, >>>>>>> int msecs_remaining; >>>>>>> struct timeval last, now; >>>>>>> >>>>>>> - entry = id_gen_fast_lookup(op_id); >>>>>>> + entry = id_gen_safe_lookup(op_id); >>>>>>> if(!entry) >>>>>>> { >>>>>>> return -PVFS_EINVAL; >>>>>>> @@ -1451,9 +1448,6 @@ int PINT_manager_complete_op(PINT_manager_t >>>>>>> manager, >>>>>>> * since blocking calls never add the op >>>>>>> * to the table >>>>>>> */ >>>>>>> - >>>>>>> - /* free the op entry */ >>>>>>> - free(entry); >>>>>>> return ret; >>>>>>> } >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Benjamin Severs >>>>>>> _______________________________________________ >>>>>>> Pvfs2-developers mailing list >>>>>>> [email protected] >>>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Becky Ligon >>>>>> OrangeFS Support and Development >>>>>> Omnibond Systems >>>>>> Anderson, South Carolina >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Becky Ligon >>>>> OrangeFS Support and Development >>>>> Omnibond Systems >>>>> Anderson, South Carolina >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Becky Ligon >>>> OrangeFS Support and Development >>>> Omnibond Systems >>>> Anderson, South Carolina >>>> >>>> >>>> >>>> _______________________________________________ >>>> Pvfs2-developers mailing list >>>> [email protected] >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers >>>> >>>> >>> >> >> >> -- >> Becky Ligon >> OrangeFS Support and Development >> Omnibond Systems >> Anderson, South Carolina >> >> >> > > > -- > Becky Ligon > OrangeFS Support and Development > Omnibond Systems > Anderson, South Carolina > > > -- Becky Ligon OrangeFS Support and Development Omnibond Systems Anderson, South Carolina
_______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
