Thanks Malahal,
I patched these changes onto our FSAL to decrement the open_fd_count in
close() and increment the count in open(), but its not working.
The 4k fd limit got exhausted in a 2 days IO run.
I suspect there is a missing close() for an open() somewhere in Ganesha
code. Have you/anyone in the community have seen this leak in any of the
FALs.
To narrow down the leak I am trying the tests with fd-cache disabled. Any
suggestions on this ?
-bharat
On Fri, Feb 16, 2018 at 11:50 AM, Malahal Naineni <mala...@gmail.com> wrote:
> See https://review.gerrithub.io/#/c/391267/ for GPFS fsal. You could do
> something similar to VFS fsal if you are using VFS fsal.
>
> Regards, Malahal.
>
> On Thu, Feb 15, 2018 at 1:19 AM, bharat singh <bharat064...@gmail.com>
> wrote:
>
>> Yeah, that worked and I don't see this going below -1. So initializing it
>> to a non-zero value have avoided this for now.
>>
>> But I still see the 4k fd limit being exhausted after 24hrs of IO. My
>> setup currently shows open_fd_count=13k but there are only 30 files.
>> # ls -al /proc/25832/fd | wc -l
>> 559
>>
>> Also /proc won't give any clue. So I still believe there are more leaks
>> to this counter than the one I saw in fsal_rdwr()
>> Regarding the proper fix, when would it be available for us to try it out.
>>
>>
>> On Mon, Feb 12, 2018 at 10:10 AM, Malahal Naineni <mala...@gmail.com>
>> wrote:
>>
>>> Technically you should use atomic fetch to read it , at least on some
>>> archs. Also your assertion might not be hit even if the atomic ops are
>>> working right. In fact, they better be working correctly.
>>>
>>> As an example, say it is 1 and both threads check for assertion. Then
>>> both threads decrement and the end value would be -1. If you want to catch
>>> in an assert, then please use the return value of the atomic decrement
>>> operation for the assertion.
>>>
>>>
>>>
>>> On Mon, Feb 12, 2018 at 9:55 PM bharat singh <bharat064...@gmail.com>
>>> wrote:
>>>
>>>> Yeah. Looks like lock-free updates to open_fd_count is creating the
>>>> issue.
>>>> There is no double close, as I couldn’t hit the assert(open_fd_count >
>>>> 0) I have added before the decrements.
>>>>
>>>> And once it hits this state, it ping-pongs between 0 & ULLONG_MAX.
>>>>
>>>> So as a workaround I have intitalized open_fd_count = <num of worker
>>>> thds> to avoid these racey decrements. I haven’t seen the warnings after
>>>> this change over a couple of hours of testing.
>>>>
>>>>
>>>>
>>>> [work-162] fsal_open :FSAL :CRIT :before increment open_fd_count0
>>>> [work-162] fsal_open :FSAL :CRIT :after increment open_fd_count1
>>>> [work-128] fsal_close :FSAL :CRIT :before decrement open_fd_count1
>>>> [work-128] fsal_close :FSAL :CRIT :after decrement open_fd_count0
>>>> [work-153] fsal_open :FSAL :CRIT :before increment open_fd_count0
>>>> [work-153] fsal_open :FSAL :CRIT :after increment open_fd_count1
>>>> [work-153] fsal_close :FSAL :CRIT :before decrement open_fd_count1
>>>> [work-162] fsal_close :FSAL :CRIT :before decrement open_fd_count1
>>>> [work-153] fsal_close :FSAL :CRIT :after decrement open_fd_count0
>>>> [work-162] fsal_close :FSAL :CRIT :after decrement
>>>> open_fd_count18446744073709551615
>>>> [work-148] mdcache_lru_fds_available :INODE LRU :CRIT :FD Hard Limit
>>>> Exceeded. Disabling FD Cache and waking LRU thread.
>>>> open_fd_count=18446744073709551615, fds_hard_limit=4055
>>>>
>>>> [work-111] fsal_open :FSAL :CRIT :before increment
>>>> open_fd_count18446744073709551615
>>>> [work-111] fsal_open :FSAL :CRIT :after increment open_fd_count0
>>>> [cache_lru] lru_run :INODE LRU :EVENT :Re-enabling FD cache.
>>>> [work-111] fsal_close :FSAL :CRIT :before decrement open_fd_count0
>>>> [work-111] fsal_close :FSAL :CRIT :after decrement
>>>> open_fd_count18446744073709551615
>>>>
>>>> -bharat
>>>>
>>>> On Sun, Feb 11, 2018 at 10:32 PM, Frank Filz <ffilz...@mindspring.com>
>>>> wrote:
>>>>
>>>>> Yea, open_fd_count is broken…
>>>>>
>>>>>
>>>>>
>>>>> We have been working on the right way to fix it.
>>>>>
>>>>>
>>>>>
>>>>> Frank
>>>>>
>>>>>
>>>>>
>>>>> *From:* bharat singh [mailto:bharat064...@gmail.com]
>>>>> *Sent:* Saturday, February 10, 2018 7:42 PM
>>>>> *To:* Malahal Naineni <mala...@gmail.com>
>>>>> *Cc:* nfs-ganesha-devel@lists.sourceforge.net
>>>>> *Subject:* Re: [Nfs-ganesha-devel] Ganesha V2.5.2: mdcache high
>>>>> open_fd_count
>>>>>
>>>>>
>>>>>
>>>>> Hey,
>>>>>
>>>>>
>>>>>
>>>>> I think there is a leak in open_fd_count.
>>>>>
>>>>>
>>>>>
>>>>> fsal_rdwr() uses fsal_open() to open the file, but uses
>>>>> obj->obj_ops.close(obj) to close the file and there is no decrement of
>>>>> open_fd_count.
>>>>>
>>>>> So this counter keeps increasing and I could easily hit the 4k hard
>>>>> limit with prolonged read/writes.
>>>>>
>>>>>
>>>>>
>>>>> I changed it to use fsal_close() as it also does the decrement. After
>>>>> this change the open_fd_count was looking OK.
>>>>>
>>>>> But recently I saw open_fd_count being underflown to
>>>>> open_fd_count=18446744073709551615
>>>>>
>>>>>
>>>>>
>>>>> So i am suspecting a double close. Any suggestions ?
>>>>>
>>>>>
>>>>>
>>>>> Code snippet from // V2.5-stable/src/FSAL/fsal_helper.c
>>>>>
>>>>> fsal_status_t fsal_rdwr(struct fsal_obj_handle *obj,
>>>>>
>>>>> fsal_io_direction_t io_direction,
>>>>>
>>>>> uint64_t offset, size_t io_size,
>>>>>
>>>>> size_t *bytes_moved, void *buffer,
>>>>>
>>>>> bool *eof,
>>>>>
>>>>> bool *sync, struct io_info *info)
>>>>>
>>>>> {
>>>>>
>>>>> ...
>>>>>
>>>>> loflags = obj->obj_ops.status(obj);
>>>>>
>>>>> while ((!fsal_is_open(obj))
>>>>>
>>>>> || (loflags && loflags != FSAL_O_RDWR && loflags !=
>>>>> openflags)) {
>>>>>
>>>>> loflags = obj->obj_ops.status(obj);
>>>>>
>>>>> if ((!fsal_is_open(obj))
>>>>>
>>>>> || (loflags && loflags != FSAL_O_RDWR
>>>>>
>>>>> && loflags != openflags)) {
>>>>>
>>>>> fsal_status = fsal_open(obj,
>>>>> openflags);
>>>>>
>>>>> if (FSAL_IS_ERROR(fsal_status))
>>>>>
>>>>> goto out;
>>>>>
>>>>> opened = true;
>>>>>
>>>>> }
>>>>>
>>>>> loflags = obj->obj_ops.status(obj);
>>>>>
>>>>> }
>>>>>
>>>>> ..
>>>>>
>>>>> if ((fsal_status.major != ERR_FSAL_NOT_OPENED)
>>>>>
>>>>> && (obj->obj_ops.status(obj) !=
>>>>> FSAL_O_CLOSED)) {
>>>>>
>>>>> LogFullDebug(COMPONENT_FSAL,
>>>>>
>>>>> "fsal_rdwr_plus:
>>>>> CLOSING file %p",
>>>>>
>>>>> obj);
>>>>>
>>>>>
>>>>>
>>>>> fsal_status =
>>>>> obj->obj_ops.close(obj); >>>>>>>> using fsal_close here ?
>>>>>
>>>>> if (FSAL_IS_ERROR(fsal_status)) {
>>>>>
>>>>> LogCrit(COMPONENT_FSAL,
>>>>>
>>>>> "Error
>>>>> closing file in fsal_rdwr_plus: %s.",
>>>>>
>>>>>
>>>>> fsal_err_txt(fsal_status));
>>>>>
>>>>> }
>>>>>
>>>>> }
>>>>>
>>>>> ...
>>>>>
>>>>> if (opened) {
>>>>>
>>>>> fsal_status = obj->obj_ops.close(obj);
>>>>> >>>>>>>> using fsal_close here ?
>>>>>
>>>>> if (FSAL_IS_ERROR(fsal_status)) {
>>>>>
>>>>> LogEvent(COMPONENT_FSAL,
>>>>>
>>>>> "fsal_rdwr_plus: close
>>>>> = %s",
>>>>>
>>>>>
>>>>> fsal_err_txt(fsal_status));
>>>>>
>>>>> goto out;
>>>>>
>>>>> }
>>>>>
>>>>> }
>>>>>
>>>>> ...
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jan 2, 2018 at 12:30 AM, Malahal Naineni <mala...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> The links I gave you will have everything you need. You should be able
>>>>> to download gerrit reviews by "git review -d <number>" or download from
>>>>> the
>>>>> gerrit web gui.
>>>>>
>>>>>
>>>>>
>>>>> "390496" is merged upstream, but the other one is not merged yet.
>>>>>
>>>>>
>>>>>
>>>>> $ git log --oneline --grep='Fix closing global file descriptors'
>>>>> origin/next
>>>>>
>>>>> 5c2efa8f0 Fix closing global file descriptors
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jan 2, 2018 at 3:22 AM, bharat singh <bharat064...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Thanks Malahal
>>>>>
>>>>>
>>>>>
>>>>> Can you point me to these issues/fixes. I will try to patch
>>>>> V2.5-stable and run my tests.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Bharat
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jan 1, 2018 at 10:20 AM, Malahal Naineni <mala...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> >> I see that mdcache keeps growing beyond the high water mark and lru
>>>>> reclamation can’t keep up.
>>>>>
>>>>>
>>>>>
>>>>> mdcache is different from "FD" cache. I don't think we found an issue
>>>>> with mdcache itself. We found couple of issues with "FD cache"
>>>>>
>>>>>
>>>>>
>>>>> 1) https://review.gerrithub.io/#/c/391266/
>>>>>
>>>>> 2) https://review.gerrithub.io/#/c/390496/
>>>>>
>>>>>
>>>>>
>>>>> Neither of them are in V2.5-stable at this point. We will have to
>>>>> backport these and others soon.
>>>>>
>>>>>
>>>>>
>>>>> Regards, Malahal.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jan 1, 2018 at 11:04 PM, bharat singh <bharat064...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Adding nfs-ganesha-support..
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 29, 2017 at 11:01 AM, bharat singh <bharat064...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>>
>>>>>
>>>>> I am testing NFSv3 Ganesha implementation against nfstest_io tool. I
>>>>> see that mdcache keeps growing beyond the high water mark and lru
>>>>> reclamation can’t keep up.
>>>>>
>>>>>
>>>>>
>>>>> [cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded. The
>>>>> LRU thread is unable to make progress in reclaiming FDs. Disabling FD
>>>>> cache.
>>>>>
>>>>> mdcache_lru_fds_available :INODE LRU :INFO :FDs above high water mark,
>>>>> waking LRU thread. open_fd_count=14196, lru_state.fds_hiwat=3686,
>>>>> lru_state.fds_lowat=2048, lru_state.fds_hard_limit=4055
>>>>>
>>>>>
>>>>>
>>>>> I am on Ganesha V2.5.2 with default config settings
>>>>>
>>>>>
>>>>>
>>>>> So couple of questions:
>>>>>
>>>>> 1. Is Ganesha tested against these kind of tools, which does a bunch
>>>>> of open/close in quick successions.
>>>>>
>>>>> 2. Is there a way to suppress these error messages and/or expedite the
>>>>> lru reclamation process.
>>>>>
>>>>> 3. Any suggestions regarding the usage of these kind of tools with
>>>>> Ganesha.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Bharat
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> -Bharat
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>>> _______________________________________________
>>>>> Nfs-ganesha-devel mailing list
>>>>> Nfs-ganesha-devel@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> -Bharat
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> -Bharat
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient&utm_term=icon>
>>>>> Virus-free.
>>>>> www.avast.com
>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient&utm_term=link>
>>>>> <#m_-3065447849412368236_m_103769442000253716_m_-3883727770123829360_m_311239901911670444_m_3830110772624612959_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -Bharat
>>>>
>>>>
>>>>
>>
>>
>> --
>> -Bharat
>>
>>
>>
>
--
-Bharat
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel