Re: [Nfs-ganesha-devel] mdcache growing beyond limits.

Pradeep Mon, 02 Apr 2018 09:32:30 -0700

We discussed this a while ago. I'm running into this again with 2.6.0. Here
is a snapshot of the lru_state (I set the max entries to 10):


{entries_hiwat = 200000, entries_used = 1772870, chunks_hiwat = 100000,
chunks_used = 16371, lru_reap_l1 = 8116842,
  lru_reap_l2 = 1637334, lru_reap_failed = 1637334, attr_from_cache =
31917512, attr_from_cache_for_client = 5975849,
  fds_system_imposed = 1048576, fds_hard_limit = 1038090, fds_hiwat =
943718, fds_lowat = 524288, futility = 0, per_lane_work = 50,
  biggest_window = 419430, prev_fd_count = 0, prev_time = 1522647830,
caching_fds = true}

As you can see it has grown well beyond the limlt set (1.7 million vs 200K
max size). lru_reap_failed indicates number of times the reap failed from
L1 and L2.
I'm wondering what can cause the reap to fail once it reaches a steady
state. It appears to me that the entry at LRU (head of the queue) is
actually being used (refcnt > 1) and there are entries in the queue with
refcnt == 1. But those are not being looked at. My understanding is that if
an entry is accessed, it must move to MRU (tail of the queue). Any idea why
the entry at LRU can have a refcnt > 1?

This can happen if the refcnt is incremented without QLOCK and if
lru_reap_impl() is called at the same time from another thread, it will
skip the first entry and return NULL. This was done in _mdcache_lru_ref()
which could cause the refcnt on the head of the queue to be incremented
while some other thread looks at it holding a QLOCK. I tried moving the
increment/dequeue in _mdcache_lru_ref() inside QLOCK; but that did not help.

Also if "get_ref()" is called for the entry at the LRU for some reason, it
will just increment refcnt and return. I think the assumption is that by
the time "get_ref() is called, the entry is supposed to be out of LRU.


Thanks,
Pradeep


On Mon, Aug 14, 2017 at 5:38 PM, Pradeep <pradeep.tho...@gmail.com> wrote:

>
>
> On Fri, Aug 11, 2017 at 8:52 AM, Daniel Gryniewicz <d...@redhat.com>
> wrote:
>
>> Right, this is reaping.  I was thinking it was the lane thread.  Reaping
>> only looks at the single LRU of each queue.  We should probably look at
>> some small number of each lane, like 2 or 3.
>>
>
> This is the lane thread, right? The background thread (lane thread?) moves
> entries from L1 to L2 depending on the refcnt. Once it is moved, it can be
> reaped by lru_reap_impl().
>
> Couple of experiments I tried that helped limit the number of cached
> inodes to somewhere close to entries_hiwat:
> 1. Added a check in lru_run() to invoke lru_run_lane() if number of cached
> entries is above entries_hiwat.
> 2. Removed the limit on per_lane_work.
>
> There were some comments on limiting promotions (from L2 to L1 or within
> L1). Any suggestions on specific things to try out?
>
> Thanks,
> Pradeep
>
>
>>
>> Frank, this, in combination with the PIN lane, it probably the issue.
>>
>> Daniel
>>
>> On 08/11/2017 11:21 AM, Pradeep wrote:
>>
>>> Hi Daniel,
>>>
>>> I'm testing with 2.5.1. I haven't changed those parameters. Those
>>> parameters only affect once you are in lru_run_lane(), right? Since the FDs
>>> are lower than low-watermark, it never calls lru_run_lane().
>>>
>>> Thanks,
>>> Pradeep
>>>
>>> On Fri, Aug 11, 2017 at 5:43 AM, Daniel Gryniewicz <d...@redhat.com
>>> <mailto:d...@redhat.com>> wrote:
>>>
>>>     Have you set Reaper_Work?  Have you changed LRU_N_Q_LANES?  (and
>>>     which version of Ganesha?)
>>>
>>>     Daniel
>>>
>>>     On 08/10/2017 07:12 PM, Pradeep wrote:
>>>
>>>         Debugged this a little more. It appears that the entries that
>>>         can be reaped are not at the LRU position (head) of the L1
>>>         queue. So those can be free'd later by lru_run(). I don't see it
>>>         happening either for some reason.
>>>
>>>         (gdb) p LRU[1].L1
>>>         $29 = {q = {next = 0x7fb459e71960, prev = 0x7fb3ec3c0d30}, id =
>>>         LRU_ENTRY_L1, size = 260379}
>>>
>>>         head of the list is an entry with refcnt 2; but there are
>>>         several entries with refcnt 1.
>>>
>>>         (gdb) p *(mdcache_lru_t *)0x7fb459e71960
>>>         $30 = {q = {next = 0x7fb43ddea8a0, prev = 0x7d68a0 <LRU+224>},
>>>         qid = LRU_ENTRY_L1, refcnt = 2, flags = 0, lane = 1, cf = 2}
>>>         (gdb) p *(mdcache_lru_t *)0x7fb43ddea8a0
>>>         $31 = {q = {next = 0x7fb3f041f9a0, prev = 0x7fb459e71960}, qid =
>>>         LRU_ENTRY_L1, refcnt = 1, flags = 0, lane = 1, cf = 0}
>>>         (gdb) p *(mdcache_lru_t *)0x7fb3f041f9a0
>>>         $32 = {q = {next = 0x7fb466960200, prev = 0x7fb43ddea8a0}, qid =
>>>         LRU_ENTRY_L1, refcnt = 1, flags = 0, lane = 1, cf = 0}
>>>         (gdb) p *(mdcache_lru_t *)0x7fb466960200
>>>         $33 = {q = {next = 0x7fb451e20570, prev = 0x7fb3f041f9a0}, qid =
>>>         LRU_ENTRY_L1, refcnt = 2, flags = 0, lane = 1, cf = 1}
>>>
>>>         The entries with refcnt 1 are moved to L2 by the background
>>>         thread (lru_run). However it does it only of the open file count
>>>         is greater than low water mark. In my case, the open_fd_count is
>>>         not high; so lru_run() doesn't call lru_run_lane() to demote
>>>         those entries to L2. What is the best approach to handle this
>>>         scenario?
>>>
>>>         Thanks,
>>>         Pradeep
>>>
>>>
>>>
>>>         On Mon, Aug 7, 2017 at 6:08 AM, Daniel Gryniewicz
>>>         <d...@redhat.com <mailto:d...@redhat.com>
>>>         <mailto:d...@redhat.com <mailto:d...@redhat.com>>> wrote:
>>>
>>>              It never has been.  In cache_inode, a pin-ref kept it from
>>>         being
>>>              reaped, now any ref beyond 1 keeps it.
>>>
>>>              On Fri, Aug 4, 2017 at 1:31 PM, Frank Filz
>>>         <ffilz...@mindspring.com <mailto:ffilz...@mindspring.com>
>>>              <mailto:ffilz...@mindspring.com
>>>
>>>         <mailto:ffilz...@mindspring.com>>> wrote:
>>>               >> I'm hitting a case where mdcache keeps growing well
>>>         beyond the
>>>              high water
>>>               >> mark. Here is a snapshot of the lru_state:
>>>               >>
>>>               >> 1 = {entries_hiwat = 100000, entries_used = 2306063,
>>>         chunks_hiwat =
>>>               > 100000,
>>>               >> chunks_used = 16462,
>>>               >>
>>>               >> It has grown to 2.3 million entries and each entry is
>>>         ~1.6K.
>>>               >>
>>>               >> I looked at the first entry in lane 0, L1 queue:
>>>               >>
>>>               >> (gdb) p LRU[0].L1
>>>               >> $9 = {q = {next = 0x7fad64256f00, prev =
>>>         0x7faf21a1bc00}, id =
>>>               >> LRU_ENTRY_L1, size = 254628}
>>>               >> (gdb) p (mdcache_entry_t *)(0x7fad64256f00-1024)
>>>               >> $10 = (mdcache_entry_t *) 0x7fad64256b00
>>>               >> (gdb) p $10->lru
>>>               >> $11 = {q = {next = 0x7fad65ea0f00, prev = 0x7d67c0
>>>         <LRU>}, qid =
>>>               >> LRU_ENTRY_L1, refcnt = 2, flags = 0, lane = 0, cf = 0}
>>>               >> (gdb) p $10->fh_hk.inavl
>>>               >> $13 = true
>>>               >
>>>               > The refcount 2 prevents reaping.
>>>               >
>>>               > There could be a refcount leak.
>>>               >
>>>               > Hmm, though, I thought the entries_hwmark was a hard
>>>         limit, guess
>>>              not...
>>>               >
>>>               > Frank
>>>               >
>>>               >> Lane 1:
>>>               >> (gdb) p LRU[1].L1
>>>               >> $18 = {q = {next = 0x7fad625c0300, prev =
>>>         0x7faec08c5100}, id =
>>>               >> LRU_ENTRY_L1, size = 253006}
>>>               >> (gdb) p (mdcache_entry_t *)(0x7fad625c0300 - 1024)
>>>               >> $21 = (mdcache_entry_t *) 0x7fad625bff00
>>>               >> (gdb) p $21->lru
>>>               >> $22 = {q = {next = 0x7fad66fce600, prev = 0x7d68a0
>>>         <LRU+224>}, qid =
>>>               >> LRU_ENTRY_L1, refcnt = 2, flags = 0, lane = 1, cf = 1}
>>>               >>
>>>               >> (gdb) p $21->fh_hk.inavl
>>>               >> $24 = true
>>>               >>
>>>               >> As per LRU_ENTRY_RECLAIMABLE(), these entry should be
>>>              reclaimable. Not
>>>               >> sure why it is not able to claim it. Any ideas?
>>>               >>
>>>               >> Thanks,
>>>               >> Pradeep
>>>               >>
>>>               >>
>>>               >
>>>                     ------------------------------
>>> ----------------------------------------------
>>>               > --
>>>               >> Check out the vibrant tech community on one of the
>>>         world's most
>>>              engaging
>>>               >> tech sites, Slashdot.org! http://sdm.link/slashdot
>>>               >> _______________________________________________
>>>               >> Nfs-ganesha-devel mailing list
>>>               >> Nfs-ganesha-devel@lists.sourceforge.net
>>>         <mailto:Nfs-ganesha-devel@lists.sourceforge.net>
>>>              <mailto:Nfs-ganesha-devel@lists.sourceforge.net
>>>         <mailto:Nfs-ganesha-devel@lists.sourceforge.net>>
>>>               >>
>>>         https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>>>         <https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel>
>>>                     <https://lists.sourceforge.net
>>> /lists/listinfo/nfs-ganesha-devel
>>>         <https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>>> >>
>>>               >
>>>               >
>>>               > ---
>>>               > This email has been checked for viruses by Avast
>>>         antivirus software.
>>>               > https://www.avast.com/antivirus
>>>         <https://www.avast.com/antivirus>
>>>         <https://www.avast.com/antivirus <https://www.avast.com/antivir
>>> us>>
>>>               >
>>>               >
>>>               >
>>>                     ------------------------------
>>> ------------------------------------------------
>>>               > Check out the vibrant tech community on one of the
>>>         world's most
>>>               > engaging tech sites, Slashdot.org!
>>> http://sdm.link/slashdot
>>>               > _______________________________________________
>>>               > Nfs-ganesha-devel mailing list
>>>               > Nfs-ganesha-devel@lists.sourceforge.net
>>>         <mailto:Nfs-ganesha-devel@lists.sourceforge.net>
>>>              <mailto:Nfs-ganesha-devel@lists.sourceforge.net
>>>         <mailto:Nfs-ganesha-devel@lists.sourceforge.net>>
>>>               >
>>>         https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>>>         <https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel>
>>>                     <https://lists.sourceforge.net
>>> /lists/listinfo/nfs-ganesha-devel
>>>         <https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>>> >>
>>>
>>>
>>>
>>>
>>>
>>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

Re: [Nfs-ganesha-devel] mdcache growing beyond limits.

Reply via email to