Re: [lustre-discuss] Lustre client memory and MemoryAvailable

NeilBrown Mon, 29 Apr 2019 20:34:04 -0700

On Tue, Apr 30 2019, Patrick Farrell wrote:

> Neil,
>
> My understanding is marking the inode cache reclaimable would make Lustre 
> unusual/unique among Linux file systems.  Is that incorrect?


I think Lustre is already somewhat unusual and unique :-)

I understand your desire to follow patterns set by more established
filesystems and that is probably a good baseline.  But when the
behaviour of the other filesystems is demonstrably wrong, then there is
little to be gained from following it.

That said:
 9p, adfs, affs, befs, bfs, btrfs, ceph, cifs, coda, efs, ext2, ext4,
 f2fs, fat, freevxfs, fuse, gfs2, hpfs, isofs, jffs2, jfs, minix, nfs,
 nilfs, ntfs, ocfs2, openpromfs, overlayfs, procfs, qnx4, qnx6,
 reiserfs, romfs, squashfs, sysvfs, ubifs, udf, ufs, xfs

all set SLAB_RECLAIM_ACCOUNT on their inode caches.  So to answer your
question: your understand *is* incorrect.

NeilBrown

>
> - Patrick
> ________________________________
> From: lustre-discuss <[email protected]> on behalf of 
> NeilBrown <[email protected]>
> Sent: Monday, April 29, 2019 8:53:43 PM
> To: Jacek Tomaka
> Cc: [email protected]
> Subject: Re: [lustre-discuss] Lustre client memory and MemoryAvailable
>
> On Mon, Apr 29 2019, Jacek Tomaka wrote:
>
>>> so lustre_inode_cache is the real culprit when signal_cache appears to
>>>  be large.
>>> This cache is slaved on the common inode cache, so there should be one
>>> entry for each lustre inode that is in memory.
>>> These inodes should get pruned when they've been inactive for a while.
>>
>> What triggers the prunning?
>>
>
> Memory pressure.
> The approx approach is try to free some unused pages and about 1/2000th of
> the entries in each slab.  Then if that hasn't made enough space
> available, try again.
>
>>>If you look in /proc/sys/fs/inode-nr  there should be two numbers:
>>>  The first is the total number of in-memory inodes for all filesystems.
>>>  The second is the number of "unused" inodes.
>>>
>>>  When you write "3" to drop_caches, the second number should drop down to
>>> nearly zero (I get 95 on my desktop, down from 6524).
>>
>> Ok, that is useful to know but echoing 3 to drop_cache or generating memory
>> pressure
>> clears most of the signal_cache (inode) as well as other lustre objects, so
>> this is working fine.
>
> Oh good, I hadn't remembered clearly what the issue was.
>
>>
>> The issue that remains is that they are marked as SUnreclaim vs
>> SReclaimable.
>
> Yes, I think lustre_inode_cache should certainly be flagged as
> SLAB_RECLAIM_ACCOUNT.
> If the SReclaimable value is too small (and there aren't many
> reclaimable pagecache pages), vmscan can decide not to bother.  This is
> probably a fairly small risk but it is possible that the missing
> SLAB_RECLAIM_ACCOUNT flag can result in memory not being reclaimed when
> it could be.
>
> Thanks,
> NeilBrown
>
>
>> So i do not think there is a memory leak per se.
>>
>> Regards.
>> Jacek Tomaka
>>
>> On Mon, Apr 29, 2019 at 1:39 PM NeilBrown <[email protected]> wrote:
>>
>>>
>>> Thanks Jacek,
>>>  so lustre_inode_cache is the real culprit when signal_cache appears to
>>>  be large.
>>>  This cache is slaved on the common inode cache, so there should be one
>>>  entry for each lustre inode that is in memory.
>>>  These inodes should get pruned when they've been inactive for a while.
>>>
>>>  If you look in /proc/sys/fs/inode-nr  there should be two numbers:
>>>   The first is the total number of in-memory inodes for all filesystems.
>>>   The second is the number of "unused" inodes.
>>>
>>>  When you write "3" to drop_caches, the second number should drop down to
>>>  nearly zero (I get 95 on my desktop, down from 6524).
>>>
>>>  When signal_cache stays large even after the drop_caches, it suggest
>>>  that there are lots of lustre inodes that are thought to be still
>>>  active.   I'd have to do a bit of digging to understand what that means,
>>>  and a lot more to work out why lustre is holding on to inodes longer
>>>  than you would expect (if that actually is the case).
>>>
>>>  If an inode still has cached data pages attached that cannot easily be
>>>  removed, it will not be purged even if it is unused.
>>>  So if you see the "unused" number remaining high even after a
>>>  "drop_caches", that might mean that lustre isn't letting go of cache
>>>  pages for some reason.
>>>
>>> NeilBrown
>>>
>>>
>>>
>>> On Mon, Apr 29 2019, Jacek Tomaka wrote:
>>>
>>> > Wow, Thanks Nathan and NeilBrown.
>>> > It is great to learn about slub merging. It is awesome to have a
>>> > reproducer.
>>> > I am yet to trigger my original problem with slurm_nomerge but
>>> > slabinfo tool (in kernel sources) can actually show merged caches:
>>> > kernel/3.10.0-693.5.2.el7/tools/slabinfo  -a
>>> >
>>> > :t-0000112   <- sysfs_dir_cache kernfs_node_cache blkdev_integrity
>>> > task_delay_info
>>> > :t-0000144   <- flow_cache cl_env_kmem
>>> > :t-0000160   <- sigqueue lov_object_kmem
>>> > :t-0000168   <- lovsub_object_kmem osc_extent_kmem
>>> > :t-0000176   <- vvp_object_kmem nfsd4_stateids
>>> > :t-0000192   <- ldlm_resources kiocb cred_jar inet_peer_cache key_jar
>>> > file_lock_cache kmalloc-192 dmaengine-unmap-16 bio_integrity_payload
>>> > :t-0000216   <- vvp_session_kmem vm_area_struct
>>> > :t-0000256   <- biovec-16 ip_dst_cache bio-0 ll_file_data kmalloc-256
>>> > sgpool-8 filp request_sock_TCP rpc_tasks request_sock_TCPv6
>>> > skbuff_head_cache pool_workqueue lov_thread_kmem
>>> > :t-0000264   <- osc_lock_kmem numa_policy
>>> > :t-0000328   <- osc_session_kmem taskstats
>>> > :t-0000576   <- kioctx xfrm_dst_cache vvp_thread_kmem
>>> > :t-0001152   <- signal_cache lustre_inode_cache
>>> >
>>> > It is not on a machine that had the problem i described before but the
>>> > kernel version is the same so I am assuming the cache merges are the
>>> same.
>>> >
>>> > Looks like signal_cache points to lustre_inode_cache.
>>> > Regards.
>>> > Jacek Tomaka
>>> >
>>> >
>>> > On Thu, Apr 25, 2019 at 7:42 AM NeilBrown <[email protected]> wrote:
>>> >
>>> >>
>>> >> Hi,
>>> >>  you seem to be able to reproduce this fairly easily.
>>> >>  If so, could you please boot with the "slub_nomerge" kernel parameter
>>> >>  and then reproduce the (apparent) memory leak.
>>> >>  I'm hoping that this will show some other slab that is actually using
>>> >>  the memory - a slab with very similar object-size to signal_cache that
>>> >>  is, by default, being merged with signal_cache.
>>> >>
>>> >> Thanks,
>>> >> NeilBrown
>>> >>
>>> >>
>>> >> On Wed, Apr 24 2019, Nathan Dauchy - NOAA Affiliate wrote:
>>> >>
>>> >> > On Mon, Apr 15, 2019 at 9:18 PM Jacek Tomaka <[email protected]> wrote:
>>> >> >
>>> >> >>
>>> >> >> >signal_cache should have one entry for each process (or
>>> thread-group).
>>> >> >>
>>> >> >> That is what i thought as well, looking at the kernel source,
>>> >> allocations
>>> >> >> from
>>> >> >> signal_cache happen only during fork.
>>> >> >>
>>> >> >>
>>> >> > I was recently chasing an issue with clients suffering from low memory
>>> >> and
>>> >> > saw that "signal_cache" was a major player.  But the workload on those
>>> >> > clients was not doing a lot of forking.  (and I don't *think*
>>> threading
>>> >> > either)  Rather it was a LOT of metadata read operations.
>>> >> >
>>> >> > You can see the symptoms by a simple "du" on a Lustre file system:
>>> >> >
>>> >> > # grep signal_cache /proc/slabinfo
>>> >> > signal_cache         967   1092   1152   28    8 : tunables    0    0
>>> >> 0
>>> >> > : slabdata     39     39      0
>>> >> >
>>> >> > # du -s /mnt/lfs1/projects/foo
>>> >> > 339744908 /mnt/lfs1/projects/foo
>>> >> >
>>> >> > # grep signal_cache /proc/slabinfo
>>> >> > signal_cache      164724 164724   1152   28    8 : tunables    0    0
>>> >> 0
>>> >> > : slabdata   5883   5883      0
>>> >> >
>>> >> > # slabtop -s c -o | head -n 20
>>> >> >  Active / Total Objects (% used)    : 3660791 / 3662863 (99.9%)
>>> >> >  Active / Total Slabs (% used)      : 93019 / 93019 (100.0%)
>>> >> >  Active / Total Caches (% used)     : 72 / 107 (67.3%)
>>> >> >  Active / Total Size (% used)       : 836474.91K / 837502.16K (99.9%)
>>> >> >  Minimum / Average / Maximum Object : 0.01K / 0.23K / 12.75K
>>> >> >
>>> >> >   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>> >> >
>>> >> > 164724 164724 100%    1.12K   5883       28    188256K signal_cache
>>> >> >
>>> >> > 331712 331712 100%    0.50K  10366       32    165856K ldlm_locks
>>> >> >
>>> >> > 656896 656896 100%    0.12K  20528       32     82112K kmalloc-128
>>> >> >
>>> >> > 340200 339971  99%    0.19K   8100       42     64800K kmalloc-192
>>> >> >
>>> >> > 162838 162838 100%    0.30K   6263       26     50104K osc_object_kmem
>>> >> >
>>> >> > 744192 744192 100%    0.06K  11628       64     46512K kmalloc-64
>>> >> >
>>> >> > 205128 205128 100%    0.19K   4884       42     39072K dentry
>>> >> >
>>> >> >   4268   4256  99%    8.00K   1067        4     34144K kmalloc-8192
>>> >> >
>>> >> > 162978 162978 100%    0.17K   3543       46     28344K vvp_object_kmem
>>> >> >
>>> >> > 162792 162792 100%    0.16K   6783       24     27132K
>>> >> kvm_mmu_page_header
>>> >> >
>>> >> > 162825 162825 100%    0.16K   6513       25     26052K sigqueue
>>> >> >
>>> >> >  16368  16368 100%    1.02K    528       31     16896K nfs_inode_cache
>>> >> >
>>> >> >  20385  20385 100%    0.58K    755       27     12080K inode_cache
>>> >> >
>>> >> >
>>> >> > Repeat that for more (and bigger) directories and slab cache added up
>>> to
>>> >> > more than half the memory on this 24GB node.
>>> >> >
>>> >> > This is with CentOS-7.6 and lustre-2.10.5_ddn6.
>>> >> >
>>> >> > I worked around the problem by tackling the "ldlm_locks" memory usage
>>> >> with:
>>> >> > # lctl set_param ldlm.namespaces.lfs*.lru_max_age=10000
>>> >> >
>>> >> > ...but I did not find a way to reduce the "signal_cache".
>>> >> >
>>> >> > Regards,
>>> >> > Nathan
>>> >>
>>> >
>>> >
>>> > --
>>> > *Jacek Tomaka*
>>> > Geophysical Software Developer
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > *DownUnder GeoSolutions*
>>> > 76 Kings Park Road
>>> > West Perth 6005 WA, Australia
>>> > *tel *+61 8 9287 4143 <+61%208%209287%204143>
>>> > [email protected]
>>> > *www.dug.com <http://www.dug.com>*
>>>
>>
>>
>> --
>> *Jacek Tomaka*
>> Geophysical Software Developer
>>
>>
>>
>>
>>
>>
>> *DownUnder GeoSolutions*
>> 76 Kings Park Road
>> West Perth 6005 WA, Australia
>> *tel *+61 8 9287 4143 <+61%208%209287%204143>
>> [email protected]
>> *www.dug.com <http://www.dug.com>*

signature.asc
Description: PGP signature

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre client memory and MemoryAvailable

Reply via email to