Re: [Nfs-ganesha-devel] mdcache growing beyond limits.

Pradeep Wed, 04 Apr 2018 09:38:57 -0700

Hi Daniel,

I tried increasing lanes to 1023. The usage looks better, but still over
the limit:


$2 = {entries_hiwat = 100000, entries_used = 299838, chunks_hiwat = 100000,
chunks_used = 1235, fds_system_imposed = 1048576,
  fds_hard_limit = 1038090, fds_hiwat = 943718, fds_lowat = 524288,
futility = 0, per_lane_work = 50, biggest_window = 419430,
  prev_fd_count = 39434, prev_time = 1522775283, caching_fds = true}

I'm trying to simulate build workload by running SpecFS SWBUILD workload.
This is with Ganesha 2.7 and FSAL_VFS. The server has 4CPU/12GB Memory.

For build 8 (40 processes), the latency increased from 5ms (with 17 lanes)
to 22 ms (with 1023 lanes) and the test failed to achieve required IOPs.

Thanks,
Pradeep

On Tue, Apr 3, 2018 at 7:58 AM, Pradeep <pradeeptho...@gmail.com> wrote:

> Hi Daniel,
>
> Sure I will try that.
>
> One thing I tried is to not allocate new entries and return
> NFS4ERR_DELAY in the hope that the increased refcnt at LRU is
> temporary. This worked for some time; but then I hit a case where I
> see all the entries at the LRU of L1 has a refcnt of 2 and the
> subsequent entries have a refcnt of 1. All L2's were empty. I realized
> that whenever a new entry is created, the refcnt is 2 and it is put at
> the LRU. Also promotions from L2 moves them to LRU of L1. So it is
> likely that many threads may end up finding no entries at LRU and end
> allocating new entries.
>
> Then I tried another experiment: Invoke lru_wake_thread() when the
> number of entries is greater than entries_hiwat; but still allocate a
> new entry for the current thread. This worked. I had to make a change
> in lru_run() to allow demotion in case of 'entries > entries_hiwat' in
> addition to max FD check. The side effect would be that it will close
> FDs and demote to L2. Almost all of these FDs are opened in the
> context of setattr/getattr; so attributes are already in cache and FDs
> are probably useless until the cache expires.  I think your idea of
> moving further down the lane may be a better approach.
>
> I will try your suggestion next. With 1023 lanes, it is unlikely that
> all lanes will have an active entry.
>
> Thanks,
> Pradeep
>
> On 4/3/18, Daniel Gryniewicz <d...@redhat.com> wrote:
> > So, the way this is supposed to work is that getting a ref when the ref
> > is 1 is always an LRU_REQ_INITIAL ref, so that moves it to the MRU.  At
> > that point, further refs don't move it around in the queue, just
> > increment the refcount.  This should be the case, because
> > mdcache_new_entry() and mdcache_find_keyed() both get an INITIAL ref,
> > and all other refs require you to already have a pointer to the entry
> > (and therefore a ref).
> >
> > Can you try something, since you have a reproducer?  It seems that, with
> > 1.7 million files, 17 lanes may be a bit low.  Can you try with
> > something ridiculously large, like 1023, and see if that makes a
> > difference?
> >
> > I suspect we'll have to add logic to move further down the lanes if
> > futility hits.
> >
> > Daniel
> >
> > On 04/02/2018 12:30 PM, Pradeep wrote:
> >> We discussed this a while ago. I'm running into this again with 2.6.0.
> >> Here is a snapshot of the lru_state (I set the max entries to 10):
> >>
> >> {entries_hiwat = 200000, entries_used = 1772870, chunks_hiwat = 100000,
> >> chunks_used = 16371, lru_reap_l1 = 8116842,
> >>    lru_reap_l2 = 1637334, lru_reap_failed = 1637334, attr_from_cache =
> >> 31917512, attr_from_cache_for_client = 5975849,
> >>    fds_system_imposed = 1048576, fds_hard_limit = 1038090, fds_hiwat =
> >> 943718, fds_lowat = 524288, futility = 0, per_lane_work = 50,
> >>    biggest_window = 419430, prev_fd_count = 0, prev_time = 1522647830,
> >> caching_fds = true}
> >>
> >> As you can see it has grown well beyond the limlt set (1.7 million vs
> >> 200K max size). lru_reap_failed indicates number of times the reap
> >> failed from L1 and L2.
> >> I'm wondering what can cause the reap to fail once it reaches a steady
> >> state. It appears to me that the entry at LRU (head of the queue) is
> >> actually being used (refcnt > 1) and there are entries in the queue with
> >> refcnt == 1. But those are not being looked at. My understanding is that
> >> if an entry is accessed, it must move to MRU (tail of the queue). Any
> >> idea why the entry at LRU can have a refcnt > 1?
> >>
> >> This can happen if the refcnt is incremented without QLOCK and if
> >> lru_reap_impl() is called at the same time from another thread, it will
> >> skip the first entry and return NULL. This was done
> >> in _mdcache_lru_ref() which could cause the refcnt on the head of the
> >> queue to be incremented while some other thread looks at it holding a
> >> QLOCK. I tried moving the increment/dequeue in _mdcache_lru_ref() inside
> >> QLOCK; but that did not help.
> >>
> >> Also if "get_ref()" is called for the entry at the LRU for some reason,
> >> it will just increment refcnt and return. I think the assumption is that
> >> by the time "get_ref() is called, the entry is supposed to be out of
> LRU.
> >>
> >>
> >> Thanks,
> >> Pradeep
> >>
> >
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

Re: [Nfs-ganesha-devel] mdcache growing beyond limits.

Reply via email to