Yeah, that worked and I don't see this going below -1. So initializing it
to a non-zero value have avoided this for now.

But I still see the 4k fd limit being exhausted after 24hrs of IO. My setup
currently shows open_fd_count=13k but there are only 30 files.
# ls -al /proc/25832/fd | wc -l
559

Also /proc won't give any clue. So I still believe there are more leaks to
this counter than the one I saw in fsal_rdwr()
Regarding the proper fix, when would it be available for us to try it out.


On Mon, Feb 12, 2018 at 10:10 AM, Malahal Naineni <mala...@gmail.com> wrote:

> Technically you should use atomic fetch to read it , at least on some
> archs. Also your assertion might not be hit even if the atomic ops are
> working right. In fact, they better be working correctly.
>
> As an example, say it is 1 and both threads check for assertion. Then both
> threads decrement and the end value would be -1.  If you want to catch in
> an assert, then please use the return value of the atomic decrement
> operation for the assertion.
>
>
>
> On Mon, Feb 12, 2018 at 9:55 PM bharat singh <bharat064...@gmail.com>
> wrote:
>
>> Yeah. Looks like lock-free updates to open_fd_count is creating the
>> issue.
>> There is no double close, as I couldn’t hit the assert(open_fd_count > 0)
>> I have added before the decrements.
>>
>> And once it hits this state, it ping-pongs between 0 & ULLONG_MAX.
>>
>> So as a workaround I have intitalized open_fd_count = <num of worker
>> thds> to avoid these racey decrements. I haven’t seen the warnings after
>> this change over a couple of hours of testing.
>>
>>
>>
>> [work-162] fsal_open :FSAL :CRIT :before increment open_fd_count0
>> [work-162] fsal_open :FSAL :CRIT :after increment open_fd_count1
>> [work-128] fsal_close :FSAL :CRIT :before decrement open_fd_count1
>> [work-128] fsal_close :FSAL :CRIT :after decrement open_fd_count0
>> [work-153] fsal_open :FSAL :CRIT :before increment open_fd_count0
>> [work-153] fsal_open :FSAL :CRIT :after increment open_fd_count1
>> [work-153] fsal_close :FSAL :CRIT :before decrement open_fd_count1
>> [work-162] fsal_close :FSAL :CRIT :before decrement open_fd_count1
>> [work-153] fsal_close :FSAL :CRIT :after decrement open_fd_count0
>> [work-162] fsal_close :FSAL :CRIT :after decrement open_fd_
>> count18446744073709551615
>> [work-148] mdcache_lru_fds_available :INODE LRU :CRIT :FD Hard Limit
>> Exceeded.  Disabling FD Cache and waking LRU thread. 
>> open_fd_count=18446744073709551615,
>> fds_hard_limit=4055
>>
>> [work-111] fsal_open :FSAL :CRIT :before increment open_fd_
>> count18446744073709551615
>> [work-111] fsal_open :FSAL :CRIT :after increment open_fd_count0
>> [cache_lru] lru_run :INODE LRU :EVENT :Re-enabling FD cache.
>> [work-111] fsal_close :FSAL :CRIT :before decrement open_fd_count0
>> [work-111] fsal_close :FSAL :CRIT :after decrement open_fd_
>> count18446744073709551615
>>
>> -bharat
>>
>> On Sun, Feb 11, 2018 at 10:32 PM, Frank Filz <ffilz...@mindspring.com>
>> wrote:
>>
>>> Yea, open_fd_count is broken…
>>>
>>>
>>>
>>> We have been working on the right way to fix it.
>>>
>>>
>>>
>>> Frank
>>>
>>>
>>>
>>> *From:* bharat singh [mailto:bharat064...@gmail.com]
>>> *Sent:* Saturday, February 10, 2018 7:42 PM
>>> *To:* Malahal Naineni <mala...@gmail.com>
>>> *Cc:* nfs-ganesha-devel@lists.sourceforge.net
>>> *Subject:* Re: [Nfs-ganesha-devel] Ganesha V2.5.2: mdcache high
>>> open_fd_count
>>>
>>>
>>>
>>> Hey,
>>>
>>>
>>>
>>> I think there is a leak in open_fd_count.
>>>
>>>
>>>
>>> fsal_rdwr() uses fsal_open() to open the file, but uses
>>> obj->obj_ops.close(obj) to close the file and there is no decrement of
>>> open_fd_count.
>>>
>>> So this counter keeps increasing and I could easily hit the 4k hard
>>> limit with prolonged read/writes.
>>>
>>>
>>>
>>> I changed it to use fsal_close() as it also does the decrement. After
>>> this change the open_fd_count was looking OK.
>>>
>>> But recently I saw open_fd_count being underflown to open_fd_count=
>>> 18446744073709551615
>>>
>>>
>>>
>>> So i am suspecting a double close. Any suggestions ?
>>>
>>>
>>>
>>>  Code snippet from // V2.5-stable/src/FSAL/fsal_helper.c
>>>
>>> fsal_status_t fsal_rdwr(struct fsal_obj_handle *obj,
>>>
>>>                             fsal_io_direction_t io_direction,
>>>
>>>                             uint64_t offset, size_t io_size,
>>>
>>>                             size_t *bytes_moved, void *buffer,
>>>
>>>                             bool *eof,
>>>
>>>                             bool *sync, struct io_info *info)
>>>
>>> {
>>>
>>> ...
>>>
>>>           loflags = obj->obj_ops.status(obj);
>>>
>>>           while ((!fsal_is_open(obj))
>>>
>>>                  || (loflags && loflags != FSAL_O_RDWR && loflags !=
>>> openflags)) {
>>>
>>>                       loflags = obj->obj_ops.status(obj);
>>>
>>>                       if ((!fsal_is_open(obj))
>>>
>>>                           || (loflags && loflags != FSAL_O_RDWR
>>>
>>>                                   && loflags != openflags)) {
>>>
>>>                                   fsal_status = fsal_open(obj,
>>> openflags);
>>>
>>>                                   if (FSAL_IS_ERROR(fsal_status))
>>>
>>>                                               goto out;
>>>
>>>                                   opened = true;
>>>
>>>                       }
>>>
>>>                       loflags = obj->obj_ops.status(obj);
>>>
>>>           }
>>>
>>> ..
>>>
>>>                       if ((fsal_status.major != ERR_FSAL_NOT_OPENED)
>>>
>>>                           && (obj->obj_ops.status(obj) !=
>>> FSAL_O_CLOSED)) {
>>>
>>>                                   LogFullDebug(COMPONENT_FSAL,
>>>
>>>                                                    "fsal_rdwr_plus:
>>> CLOSING file %p",
>>>
>>>                                                    obj);
>>>
>>>
>>>
>>>                                   fsal_status =
>>> obj->obj_ops.close(obj);   >>>>>>>> using fsal_close here ?
>>>
>>>                                   if (FSAL_IS_ERROR(fsal_status)) {
>>>
>>>                                               LogCrit(COMPONENT_FSAL,
>>>
>>>                                                           "Error
>>> closing file in fsal_rdwr_plus: %s.",
>>>
>>>
>>> fsal_err_txt(fsal_status));
>>>
>>>                                   }
>>>
>>>                       }
>>>
>>> ...
>>>
>>>           if (opened) {
>>>
>>>                       fsal_status = obj->obj_ops.close(obj);    >>>>>>>>
>>> using fsal_close here ?
>>>
>>>                       if (FSAL_IS_ERROR(fsal_status)) {
>>>
>>>                                   LogEvent(COMPONENT_FSAL,
>>>
>>>                                               "fsal_rdwr_plus: close =
>>> %s",
>>>
>>>
>>> fsal_err_txt(fsal_status));
>>>
>>>                                   goto out;
>>>
>>>                       }
>>>
>>>           }
>>>
>>> ...
>>>
>>> }
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jan 2, 2018 at 12:30 AM, Malahal Naineni <mala...@gmail.com>
>>> wrote:
>>>
>>> The links I gave you will have everything you need. You should be able
>>> to download gerrit reviews by "git review -d <number>" or download from the
>>> gerrit web gui.
>>>
>>>
>>>
>>> "390496" is merged upstream, but the other one is not merged yet.
>>>
>>>
>>>
>>> $ git log --oneline --grep='Fix closing global file descriptors'
>>> origin/next
>>>
>>> 5c2efa8f0 Fix closing global file descriptors
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jan 2, 2018 at 3:22 AM, bharat singh <bharat064...@gmail.com>
>>> wrote:
>>>
>>> Thanks Malahal
>>>
>>>
>>>
>>> Can you point me to these issues/fixes. I will try to patch V2.5-stable
>>> and run my tests.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Bharat
>>>
>>>
>>>
>>> On Mon, Jan 1, 2018 at 10:20 AM, Malahal Naineni <mala...@gmail.com>
>>> wrote:
>>>
>>> >> I see that mdcache keeps growing beyond the high water mark and lru
>>> reclamation can’t keep up.
>>>
>>>
>>>
>>> mdcache is different from "FD" cache. I don't think we found an issue
>>> with mdcache itself. We found couple of issues with "FD cache"
>>>
>>>
>>>
>>> 1) https://review.gerrithub.io/#/c/391266/
>>>
>>> 2) https://review.gerrithub.io/#/c/390496/
>>>
>>>
>>>
>>> Neither of them are in V2.5-stable at this point. We will have to
>>> backport these and others soon.
>>>
>>>
>>>
>>> Regards, Malahal.
>>>
>>>
>>>
>>> On Mon, Jan 1, 2018 at 11:04 PM, bharat singh <bharat064...@gmail.com>
>>> wrote:
>>>
>>> Adding nfs-ganesha-support..
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Dec 29, 2017 at 11:01 AM, bharat singh <bharat064...@gmail.com>
>>> wrote:
>>>
>>> Hello,
>>>
>>>
>>>
>>> I am testing NFSv3 Ganesha implementation against nfstest_io tool. I see
>>> that mdcache keeps growing beyond the high water mark and lru
>>> reclamation can’t keep up.
>>>
>>>
>>>
>>> [cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU
>>> thread is unable to make progress in reclaiming FDs.  Disabling FD cache.
>>>
>>> mdcache_lru_fds_available :INODE LRU :INFO :FDs above high water mark,
>>> waking LRU thread. open_fd_count=14196, lru_state.fds_hiwat=3686,
>>> lru_state.fds_lowat=2048, lru_state.fds_hard_limit=4055
>>>
>>>
>>>
>>> I am on Ganesha V2.5.2 with default config settings
>>>
>>>
>>>
>>> So couple of questions:
>>>
>>> 1. Is Ganesha tested against these kind of tools, which does a bunch of
>>> open/close in quick successions.
>>>
>>> 2. Is there a way to suppress these error messages and/or expedite the
>>> lru reclamation process.
>>>
>>> 3. Any suggestions regarding the usage of these kind of tools with
>>> Ganesha.
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Bharat
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> -Bharat
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Nfs-ganesha-devel mailing list
>>> Nfs-ganesha-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> -Bharat
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> -Bharat
>>>
>>>
>>>
>>>
>>>
>>>
>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient&utm_term=icon>
>>>  Virus-free.
>>> www.avast.com
>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient&utm_term=link>
>>> <#m_-3883727770123829360_m_311239901911670444_m_3830110772624612959_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>
>>
>>
>>
>> --
>> -Bharat
>>
>>
>>


-- 
-Bharat
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

Reply via email to