Thanks, Bart!  I just found it in the pvfs2-developers archives as well!!!!

It looks like I applied all of these patches to OrangeFS on 11/05/2010.  Of
course, none of us realized that this patch would cause memory *not* to be
freed.  So, I am still looking into what the solution should be!

Becky

On Thu, Jun 30, 2011 at 3:20 PM, Bart Taylor <[email protected]> wrote:

> Here you go Becky. I got this patch from Sam on 10/20/2010.
>
> Bart.
>
>
>
>
> On Thu, Jun 30, 2011 at 1:11 PM, Becky Ligon <[email protected]> wrote:
>
>> Ben:
>>
>> Do you have a copy of the original fix, "pint-mgmt-segv-fix.patch" that
>> Sam sent to you?  If so, can you send it to me?
>>
>> Thanks,
>> Becky
>>
>>
>> On Thu, Jun 30, 2011 at 12:08 PM, Becky Ligon <[email protected]> wrote:
>>
>>> I think I made these changes due to a problem  that we were having on our
>>> Palmetto system.  Just a quick look leads me to believe that the change to
>>> PINT_context_test_all should not have freed the op_id; I suspect that this
>>> may prevent a cleanup of it later down the road.  I will look at it closer
>>> to see if that is the case.  The change to PINT_context_test is okay, I
>>> think.
>>>
>>> Becky
>>>
>>>
>>> On Thu, Jun 30, 2011 at 11:38 AM, Becky Ligon <[email protected]>wrote:
>>>
>>>> I remember making this change but I'm not certain if I found the problem
>>>> or was asked to add this to the repository.  Seems like Sam Lang was 
>>>> helping
>>>> someone with a problem like this and then I patched it.  In any case, I'll
>>>> track it down and see what I can do.
>>>>
>>>> Becky
>>>>
>>>>
>>>> On Thu, Jun 30, 2011 at 11:20 AM, Benjamin Severs <
>>>> [email protected]> wrote:
>>>>
>>>>> On Wednesday, June 29, 2011 08:49:46 Michael Moore wrote:
>>>>> > Bart,
>>>>> >
>>>>> > This doesn't look like a leak in the "can't free memory because there
>>>>> isn't
>>>>> > a reference to it" sense. However, there is definitely memory usage
>>>>> growth,
>>>>> > although, I haven't been able to replicate usage with quite as steep
>>>>> a
>>>>> > curve as your graph. That graph was based on system memory usage?
>>>>> Would
>>>>> > you mind running the same test and collecting the output from
>>>>> > /proc/<PID>/statm at some interval while the test is running where
>>>>> PID is
>>>>> > the server process' PID? Heap profiling so far makes it look like the
>>>>> > increased usage is coming from libraries not due to memory
>>>>> allocations
>>>>> > within OrangeFS but no definitive answer yet.
>>>>> >
>>>>> > Has anyone seen this behavior in the past?
>>>>> >
>>>>> > Thanks,
>>>>> > Michael
>>>>> >
>>>>>
>>>>> I believe I tracked the commit down which introduced the leak.  I can
>>>>> revert
>>>>> this commit and the leak appears to go away.  I haven't yet figured out
>>>>> what is
>>>>> actually leaking though.
>>>>>
>>>>> The commit is:
>>>>> Author: bligon <bligon>
>>>>> Date:   Fri Nov 5 16:09:27 2010 +0000
>>>>>
>>>>>    Corrects problem where the server tries to cancel a timed-out i/o
>>>>> operation but
>>>>>    before the cancel can happen, the operation finishes; so, the cancel
>>>>> can't
>>>>>    find the op id and subsequently caused a segfault.'
>>>>>
>>>>>
>>>>> diff --git a/src/common/mgmt/pint-context.c
>>>>> b/src/common/mgmt/pint-context.c
>>>>> index 0a07256..cda038c 100644
>>>>> --- a/src/common/mgmt/pint-context.c
>>>>> +++ b/src/common/mgmt/pint-context.c
>>>>> @@ -422,10 +422,17 @@ int PINT_context_test_all(PINT_context_id
>>>>> context_id,
>>>>>
>>>>>     for(; i < *count; ++i)
>>>>>     {
>>>>> +        struct PINT_op_entry *op_entry;
>>>>>         qentry = user_ptrs[i];
>>>>>         ctx_entry = PINT_queue_entry_object(
>>>>>             qentry, struct PINT_context_queue_entry, qentry);
>>>>>         op_ids[i] = ctx_entry->op_id;
>>>>> +        op_entry = id_gen_safe_lookup(ctx_entry->op_id);
>>>>> +        if (op_entry)
>>>>> +        {
>>>>> +           id_gen_safe_unregister(op_entry->op.id);
>>>>> +           free(op_entry);
>>>>> +        }
>>>>>         uptr = ctx_entry->user_ptr;
>>>>>         errors[i] = ctx_entry->result;
>>>>>         user_ptrs[i] = uptr;
>>>>> @@ -519,11 +526,19 @@ int PINT_context_test(PINT_context_id context_id,
>>>>>                                     &op_id, &qentry, microsecs);
>>>>>     if(ret == 0)
>>>>>     {
>>>>> +        struct PINT_op_entry *op_entry;
>>>>>         entry = PINT_queue_entry_object(
>>>>>             qentry,
>>>>>             struct PINT_context_queue_entry, qentry);
>>>>> +        op_entry = id_gen_safe_lookup(entry->op_id);
>>>>> +        if (op_entry)
>>>>> +        {
>>>>> +           id_gen_safe_unregister(op_entry->op.id);
>>>>> +           free(op_entry);
>>>>> +        }
>>>>>         *user_ptr = entry->user_ptr;
>>>>>         *error = entry->result;
>>>>> +
>>>>>         free(entry);
>>>>>     }
>>>>>
>>>>> diff --git a/src/common/mgmt/pint-context.h
>>>>> b/src/common/mgmt/pint-context.h
>>>>> index 78dfb4b..147be5c 100644
>>>>> --- a/src/common/mgmt/pint-context.h
>>>>> +++ b/src/common/mgmt/pint-context.h
>>>>> @@ -9,6 +9,7 @@
>>>>>
>>>>>  #include "pint-op.h"
>>>>>  #include "pint-queue.h"
>>>>> +#include "quickhash.h"
>>>>>
>>>>>  enum PINT_context_type
>>>>>  {
>>>>> @@ -18,6 +19,19 @@ enum PINT_context_type
>>>>>
>>>>>  typedef PVFS_id_gen_t PINT_context_id;
>>>>>
>>>>> +struct PINT_op_entry
>>>>> +{
>>>>> +   void *user_ptr;
>>>>> +   PINT_operation_t op;
>>>>> +   PVFS_id_gen_t wq_id;
>>>>> +   PINT_context_id ctx_id;
>>>>> +   PVFS_error error;
>>>>> +
>>>>> +   struct qhash_head link;
>>>>> +};
>>>>> +
>>>>> +
>>>>> +
>>>>>  typedef int (*PINT_completion_callback)(PINT_context_id ctx_id,
>>>>>                                         int count,
>>>>>                                         PINT_op_id *op_ids,
>>>>> diff --git a/src/common/mgmt/pint-mgmt.c b/src/common/mgmt/pint-mgmt.c
>>>>> index fa52d8d..120a5a1 100644
>>>>> --- a/src/common/mgmt/pint-mgmt.c
>>>>> +++ b/src/common/mgmt/pint-mgmt.c
>>>>> @@ -42,17 +42,6 @@ struct PINT_worker_s
>>>>>     struct qlist_head link;
>>>>>  };
>>>>>
>>>>> -struct PINT_op_entry
>>>>> -{
>>>>> -    void *user_ptr;
>>>>> -    PINT_operation_t op;
>>>>> -    PVFS_id_gen_t wq_id;
>>>>> -    PINT_context_id ctx_id;
>>>>> -    PVFS_error error;
>>>>> -
>>>>> -    struct qhash_head link;
>>>>> -};
>>>>> -
>>>>>  struct PINT_manager_s
>>>>>  {
>>>>>     PINT_context_id context;
>>>>> @@ -634,12 +623,15 @@ int PINT_manager_ctx_post(PINT_manager_t manager,
>>>>>         return ret;
>>>>>     }
>>>>>
>>>>> -    id_gen_fast_register(&op_entry->op.id, op_entry);
>>>>> -
>>>>>     gossip_debug(GOSSIP_MGMT_DEBUG,
>>>>>                  "[MGMT]: manager ops: adding op id: %llu\n",
>>>>>                  llu(op_entry->op.id));
>>>>>     gen_mutex_lock(&manager->mutex);
>>>>> +    ret = id_gen_safe_register(&op_entry->op.id, op_entry);
>>>>> +    if (ret < 0)
>>>>> +    {
>>>>> +       return ret;
>>>>> +    }
>>>>>     qhash_add(manager->ops, &op_entry->op.id, &op_entry->link);
>>>>>     manager->op_count++;
>>>>>     gen_mutex_unlock(&manager->mutex);
>>>>> @@ -657,6 +649,7 @@ int PINT_manager_ctx_post(PINT_manager_t manager,
>>>>>          * operation and return the error
>>>>>          */
>>>>>         gen_mutex_lock(&manager->mutex);
>>>>> +        id_gen_safe_unregister(op_entry->op.id);
>>>>>         qhash_search_and_remove(manager->ops, &op_entry->op.id);
>>>>>         gen_mutex_unlock(&manager->mutex);
>>>>>
>>>>> @@ -1047,7 +1040,7 @@ int PINT_manager_test_op(PINT_manager_t manager,
>>>>>     struct PINT_op_entry *entry;
>>>>>     PINT_context_id context;
>>>>>
>>>>> -    entry = id_gen_fast_lookup(op_id);
>>>>> +    entry = id_gen_safe_lookup(op_id);
>>>>>     if(!entry)
>>>>>     {
>>>>>         return -PVFS_EINVAL;
>>>>> @@ -1143,13 +1136,17 @@ int PINT_manager_cancel(PINT_manager_t manager,
>>>>>                        PINT_queue_id,
>>>>>                        PINT_operation_t *);
>>>>>
>>>>> -    entry = id_gen_fast_lookup(op_id);
>>>>> +    entry = id_gen_safe_lookup(op_id);
>>>>>     if(!entry)
>>>>>     {
>>>>>         return -PVFS_EINVAL;
>>>>>     }
>>>>>     ret = PINT_manager_find_worker(
>>>>>         manager, NULL, NULL, NULL, entry->wq_id, &worker, &queue_id);
>>>>> +    if (ret != 0)
>>>>> +    {
>>>>> +       return ret;
>>>>> +    }
>>>>>     context = entry->ctx_id;
>>>>>     op = &entry->op;
>>>>>
>>>>> @@ -1215,7 +1212,7 @@ int PINT_manager_wait_op(PINT_manager_t manager,
>>>>>     int msecs_remaining;
>>>>>     struct timeval last, now;
>>>>>
>>>>> -    entry = id_gen_fast_lookup(op_id);
>>>>> +    entry = id_gen_safe_lookup(op_id);
>>>>>     if(!entry)
>>>>>     {
>>>>>         return -PVFS_EINVAL;
>>>>> @@ -1451,9 +1448,6 @@ int PINT_manager_complete_op(PINT_manager_t
>>>>> manager,
>>>>>      * since blocking calls never add the op
>>>>>      * to the table
>>>>>      */
>>>>> -
>>>>> -    /* free the op entry */
>>>>> -    free(entry);
>>>>>     return ret;
>>>>>  }
>>>>>
>>>>>
>>>>> --
>>>>> Benjamin Severs
>>>>> _______________________________________________
>>>>> Pvfs2-developers mailing list
>>>>> [email protected]
>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Becky Ligon
>>>> OrangeFS Support and Development
>>>> Omnibond Systems
>>>> Anderson, South Carolina
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Becky Ligon
>>> OrangeFS Support and Development
>>> Omnibond Systems
>>> Anderson, South Carolina
>>>
>>>
>>>
>>
>>
>> --
>> Becky Ligon
>> OrangeFS Support and Development
>> Omnibond Systems
>> Anderson, South Carolina
>>
>>
>>
>> _______________________________________________
>> Pvfs2-developers mailing list
>> [email protected]
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
>>
>


-- 
Becky Ligon
OrangeFS Support and Development
Omnibond Systems
Anderson, South Carolina
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to