Yeah. Looks like lock-free updates to open_fd_count is creating the issue.
There is no double close, as I couldn’t hit the assert(open_fd_count > 0) I
have added before the decrements.
And once it hits this state, it ping-pongs between 0 & ULLONG_MAX.
So as a workaround I have intitalized open_fd_count = <num of worker thds>
to avoid these racey decrements. I haven’t seen the warnings after this
change over a couple of hours of testing.
[work-162] fsal_open :FSAL :CRIT :before increment open_fd_count0
[work-162] fsal_open :FSAL :CRIT :after increment open_fd_count1
[work-128] fsal_close :FSAL :CRIT :before decrement open_fd_count1
[work-128] fsal_close :FSAL :CRIT :after decrement open_fd_count0
[work-153] fsal_open :FSAL :CRIT :before increment open_fd_count0
[work-153] fsal_open :FSAL :CRIT :after increment open_fd_count1
[work-153] fsal_close :FSAL :CRIT :before decrement open_fd_count1
[work-162] fsal_close :FSAL :CRIT :before decrement open_fd_count1
[work-153] fsal_close :FSAL :CRIT :after decrement open_fd_count0
[work-162] fsal_close :FSAL :CRIT :after decrement
open_fd_count18446744073709551615
[work-148] mdcache_lru_fds_available :INODE LRU :CRIT :FD Hard Limit
Exceeded. Disabling FD Cache and waking LRU thread.
open_fd_count=18446744073709551615, fds_hard_limit=4055
[work-111] fsal_open :FSAL :CRIT :before increment
open_fd_count18446744073709551615
[work-111] fsal_open :FSAL :CRIT :after increment open_fd_count0
[cache_lru] lru_run :INODE LRU :EVENT :Re-enabling FD cache.
[work-111] fsal_close :FSAL :CRIT :before decrement open_fd_count0
[work-111] fsal_close :FSAL :CRIT :after decrement
open_fd_count18446744073709551615
-bharat
On Sun, Feb 11, 2018 at 10:32 PM, Frank Filz <ffilz...@mindspring.com>
wrote:
> Yea, open_fd_count is broken…
>
>
>
> We have been working on the right way to fix it.
>
>
>
> Frank
>
>
>
> *From:* bharat singh [mailto:bharat064...@gmail.com]
> *Sent:* Saturday, February 10, 2018 7:42 PM
> *To:* Malahal Naineni <mala...@gmail.com>
> *Cc:* nfs-ganesha-devel@lists.sourceforge.net
> *Subject:* Re: [Nfs-ganesha-devel] Ganesha V2.5.2: mdcache high
> open_fd_count
>
>
>
> Hey,
>
>
>
> I think there is a leak in open_fd_count.
>
>
>
> fsal_rdwr() uses fsal_open() to open the file, but uses
> obj->obj_ops.close(obj) to close the file and there is no decrement of
> open_fd_count.
>
> So this counter keeps increasing and I could easily hit the 4k hard limit
> with prolonged read/writes.
>
>
>
> I changed it to use fsal_close() as it also does the decrement. After this
> change the open_fd_count was looking OK.
>
> But recently I saw open_fd_count being underflown to open_fd_count=
> 18446744073709551615
>
>
>
> So i am suspecting a double close. Any suggestions ?
>
>
>
> Code snippet from // V2.5-stable/src/FSAL/fsal_helper.c
>
> fsal_status_t fsal_rdwr(struct fsal_obj_handle *obj,
>
> fsal_io_direction_t io_direction,
>
> uint64_t offset, size_t io_size,
>
> size_t *bytes_moved, void *buffer,
>
> bool *eof,
>
> bool *sync, struct io_info *info)
>
> {
>
> ...
>
> loflags = obj->obj_ops.status(obj);
>
> while ((!fsal_is_open(obj))
>
> || (loflags && loflags != FSAL_O_RDWR && loflags !=
> openflags)) {
>
> loflags = obj->obj_ops.status(obj);
>
> if ((!fsal_is_open(obj))
>
> || (loflags && loflags != FSAL_O_RDWR
>
> && loflags != openflags)) {
>
> fsal_status = fsal_open(obj, openflags);
>
> if (FSAL_IS_ERROR(fsal_status))
>
> goto out;
>
> opened = true;
>
> }
>
> loflags = obj->obj_ops.status(obj);
>
> }
>
> ..
>
> if ((fsal_status.major != ERR_FSAL_NOT_OPENED)
>
> && (obj->obj_ops.status(obj) != FSAL_O_CLOSED)) {
>
> LogFullDebug(COMPONENT_FSAL,
>
> "fsal_rdwr_plus:
> CLOSING file %p",
>
> obj);
>
>
>
> fsal_status = obj->obj_ops.close(obj);
> >>>>>>>> using fsal_close here ?
>
> if (FSAL_IS_ERROR(fsal_status)) {
>
> LogCrit(COMPONENT_FSAL,
>
> "Error closing
> file in fsal_rdwr_plus: %s.",
>
>
> fsal_err_txt(fsal_status));
>
> }
>
> }
>
> ...
>
> if (opened) {
>
> fsal_status = obj->obj_ops.close(obj); >>>>>>>>
> using fsal_close here ?
>
> if (FSAL_IS_ERROR(fsal_status)) {
>
> LogEvent(COMPONENT_FSAL,
>
> "fsal_rdwr_plus: close =
> %s",
>
> fsal_err_txt(fsal_status));
>
> goto out;
>
> }
>
> }
>
> ...
>
> }
>
>
>
>
>
> On Tue, Jan 2, 2018 at 12:30 AM, Malahal Naineni <mala...@gmail.com>
> wrote:
>
> The links I gave you will have everything you need. You should be able to
> download gerrit reviews by "git review -d <number>" or download from the
> gerrit web gui.
>
>
>
> "390496" is merged upstream, but the other one is not merged yet.
>
>
>
> $ git log --oneline --grep='Fix closing global file descriptors'
> origin/next
>
> 5c2efa8f0 Fix closing global file descriptors
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Jan 2, 2018 at 3:22 AM, bharat singh <bharat064...@gmail.com>
> wrote:
>
> Thanks Malahal
>
>
>
> Can you point me to these issues/fixes. I will try to patch V2.5-stable
> and run my tests.
>
>
>
> Thanks,
>
> Bharat
>
>
>
> On Mon, Jan 1, 2018 at 10:20 AM, Malahal Naineni <mala...@gmail.com>
> wrote:
>
> >> I see that mdcache keeps growing beyond the high water mark and lru
> reclamation can’t keep up.
>
>
>
> mdcache is different from "FD" cache. I don't think we found an issue with
> mdcache itself. We found couple of issues with "FD cache"
>
>
>
> 1) https://review.gerrithub.io/#/c/391266/
>
> 2) https://review.gerrithub.io/#/c/390496/
>
>
>
> Neither of them are in V2.5-stable at this point. We will have to backport
> these and others soon.
>
>
>
> Regards, Malahal.
>
>
>
> On Mon, Jan 1, 2018 at 11:04 PM, bharat singh <bharat064...@gmail.com>
> wrote:
>
> Adding nfs-ganesha-support..
>
>
>
>
>
> On Fri, Dec 29, 2017 at 11:01 AM, bharat singh <bharat064...@gmail.com>
> wrote:
>
> Hello,
>
>
>
> I am testing NFSv3 Ganesha implementation against nfstest_io tool. I see
> that mdcache keeps growing beyond the high water mark and lru reclamation
> can’t keep up.
>
>
>
> [cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded. The LRU
> thread is unable to make progress in reclaiming FDs. Disabling FD cache.
>
> mdcache_lru_fds_available :INODE LRU :INFO :FDs above high water mark,
> waking LRU thread. open_fd_count=14196, lru_state.fds_hiwat=3686,
> lru_state.fds_lowat=2048, lru_state.fds_hard_limit=4055
>
>
>
> I am on Ganesha V2.5.2 with default config settings
>
>
>
> So couple of questions:
>
> 1. Is Ganesha tested against these kind of tools, which does a bunch of
> open/close in quick successions.
>
> 2. Is there a way to suppress these error messages and/or expedite the lru
> reclamation process.
>
> 3. Any suggestions regarding the usage of these kind of tools with Ganesha.
>
>
>
>
>
> Thanks,
>
> Bharat
>
>
>
>
>
> --
>
> -Bharat
>
>
>
>
>
>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Nfs-ganesha-devel mailing list
> Nfs-ganesha-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>
>
>
>
>
>
>
> --
>
> -Bharat
>
>
>
>
>
>
>
>
>
>
>
> --
>
> -Bharat
>
>
>
>
>
>
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient&utm_term=icon>
> Virus-free.
> www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient&utm_term=link>
> <#m_3830110772624612959_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
--
-Bharat
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel