On Tue, Apr 30 2019, Patrick Farrell wrote: > Neil, > > My understanding is marking the inode cache reclaimable would make Lustre > unusual/unique among Linux file systems. Is that incorrect?
I think Lustre is already somewhat unusual and unique :-) I understand your desire to follow patterns set by more established filesystems and that is probably a good baseline. But when the behaviour of the other filesystems is demonstrably wrong, then there is little to be gained from following it. That said: 9p, adfs, affs, befs, bfs, btrfs, ceph, cifs, coda, efs, ext2, ext4, f2fs, fat, freevxfs, fuse, gfs2, hpfs, isofs, jffs2, jfs, minix, nfs, nilfs, ntfs, ocfs2, openpromfs, overlayfs, procfs, qnx4, qnx6, reiserfs, romfs, squashfs, sysvfs, ubifs, udf, ufs, xfs all set SLAB_RECLAIM_ACCOUNT on their inode caches. So to answer your question: your understand *is* incorrect. NeilBrown > > - Patrick > ________________________________ > From: lustre-discuss <[email protected]> on behalf of > NeilBrown <[email protected]> > Sent: Monday, April 29, 2019 8:53:43 PM > To: Jacek Tomaka > Cc: [email protected] > Subject: Re: [lustre-discuss] Lustre client memory and MemoryAvailable > > On Mon, Apr 29 2019, Jacek Tomaka wrote: > >>> so lustre_inode_cache is the real culprit when signal_cache appears to >>> be large. >>> This cache is slaved on the common inode cache, so there should be one >>> entry for each lustre inode that is in memory. >>> These inodes should get pruned when they've been inactive for a while. >> >> What triggers the prunning? >> > > Memory pressure. > The approx approach is try to free some unused pages and about 1/2000th of > the entries in each slab. Then if that hasn't made enough space > available, try again. > >>>If you look in /proc/sys/fs/inode-nr there should be two numbers: >>> The first is the total number of in-memory inodes for all filesystems. >>> The second is the number of "unused" inodes. >>> >>> When you write "3" to drop_caches, the second number should drop down to >>> nearly zero (I get 95 on my desktop, down from 6524). >> >> Ok, that is useful to know but echoing 3 to drop_cache or generating memory >> pressure >> clears most of the signal_cache (inode) as well as other lustre objects, so >> this is working fine. > > Oh good, I hadn't remembered clearly what the issue was. > >> >> The issue that remains is that they are marked as SUnreclaim vs >> SReclaimable. > > Yes, I think lustre_inode_cache should certainly be flagged as > SLAB_RECLAIM_ACCOUNT. > If the SReclaimable value is too small (and there aren't many > reclaimable pagecache pages), vmscan can decide not to bother. This is > probably a fairly small risk but it is possible that the missing > SLAB_RECLAIM_ACCOUNT flag can result in memory not being reclaimed when > it could be. > > Thanks, > NeilBrown > > >> So i do not think there is a memory leak per se. >> >> Regards. >> Jacek Tomaka >> >> On Mon, Apr 29, 2019 at 1:39 PM NeilBrown <[email protected]> wrote: >> >>> >>> Thanks Jacek, >>> so lustre_inode_cache is the real culprit when signal_cache appears to >>> be large. >>> This cache is slaved on the common inode cache, so there should be one >>> entry for each lustre inode that is in memory. >>> These inodes should get pruned when they've been inactive for a while. >>> >>> If you look in /proc/sys/fs/inode-nr there should be two numbers: >>> The first is the total number of in-memory inodes for all filesystems. >>> The second is the number of "unused" inodes. >>> >>> When you write "3" to drop_caches, the second number should drop down to >>> nearly zero (I get 95 on my desktop, down from 6524). >>> >>> When signal_cache stays large even after the drop_caches, it suggest >>> that there are lots of lustre inodes that are thought to be still >>> active. I'd have to do a bit of digging to understand what that means, >>> and a lot more to work out why lustre is holding on to inodes longer >>> than you would expect (if that actually is the case). >>> >>> If an inode still has cached data pages attached that cannot easily be >>> removed, it will not be purged even if it is unused. >>> So if you see the "unused" number remaining high even after a >>> "drop_caches", that might mean that lustre isn't letting go of cache >>> pages for some reason. >>> >>> NeilBrown >>> >>> >>> >>> On Mon, Apr 29 2019, Jacek Tomaka wrote: >>> >>> > Wow, Thanks Nathan and NeilBrown. >>> > It is great to learn about slub merging. It is awesome to have a >>> > reproducer. >>> > I am yet to trigger my original problem with slurm_nomerge but >>> > slabinfo tool (in kernel sources) can actually show merged caches: >>> > kernel/3.10.0-693.5.2.el7/tools/slabinfo -a >>> > >>> > :t-0000112 <- sysfs_dir_cache kernfs_node_cache blkdev_integrity >>> > task_delay_info >>> > :t-0000144 <- flow_cache cl_env_kmem >>> > :t-0000160 <- sigqueue lov_object_kmem >>> > :t-0000168 <- lovsub_object_kmem osc_extent_kmem >>> > :t-0000176 <- vvp_object_kmem nfsd4_stateids >>> > :t-0000192 <- ldlm_resources kiocb cred_jar inet_peer_cache key_jar >>> > file_lock_cache kmalloc-192 dmaengine-unmap-16 bio_integrity_payload >>> > :t-0000216 <- vvp_session_kmem vm_area_struct >>> > :t-0000256 <- biovec-16 ip_dst_cache bio-0 ll_file_data kmalloc-256 >>> > sgpool-8 filp request_sock_TCP rpc_tasks request_sock_TCPv6 >>> > skbuff_head_cache pool_workqueue lov_thread_kmem >>> > :t-0000264 <- osc_lock_kmem numa_policy >>> > :t-0000328 <- osc_session_kmem taskstats >>> > :t-0000576 <- kioctx xfrm_dst_cache vvp_thread_kmem >>> > :t-0001152 <- signal_cache lustre_inode_cache >>> > >>> > It is not on a machine that had the problem i described before but the >>> > kernel version is the same so I am assuming the cache merges are the >>> same. >>> > >>> > Looks like signal_cache points to lustre_inode_cache. >>> > Regards. >>> > Jacek Tomaka >>> > >>> > >>> > On Thu, Apr 25, 2019 at 7:42 AM NeilBrown <[email protected]> wrote: >>> > >>> >> >>> >> Hi, >>> >> you seem to be able to reproduce this fairly easily. >>> >> If so, could you please boot with the "slub_nomerge" kernel parameter >>> >> and then reproduce the (apparent) memory leak. >>> >> I'm hoping that this will show some other slab that is actually using >>> >> the memory - a slab with very similar object-size to signal_cache that >>> >> is, by default, being merged with signal_cache. >>> >> >>> >> Thanks, >>> >> NeilBrown >>> >> >>> >> >>> >> On Wed, Apr 24 2019, Nathan Dauchy - NOAA Affiliate wrote: >>> >> >>> >> > On Mon, Apr 15, 2019 at 9:18 PM Jacek Tomaka <[email protected]> wrote: >>> >> > >>> >> >> >>> >> >> >signal_cache should have one entry for each process (or >>> thread-group). >>> >> >> >>> >> >> That is what i thought as well, looking at the kernel source, >>> >> allocations >>> >> >> from >>> >> >> signal_cache happen only during fork. >>> >> >> >>> >> >> >>> >> > I was recently chasing an issue with clients suffering from low memory >>> >> and >>> >> > saw that "signal_cache" was a major player. But the workload on those >>> >> > clients was not doing a lot of forking. (and I don't *think* >>> threading >>> >> > either) Rather it was a LOT of metadata read operations. >>> >> > >>> >> > You can see the symptoms by a simple "du" on a Lustre file system: >>> >> > >>> >> > # grep signal_cache /proc/slabinfo >>> >> > signal_cache 967 1092 1152 28 8 : tunables 0 0 >>> >> 0 >>> >> > : slabdata 39 39 0 >>> >> > >>> >> > # du -s /mnt/lfs1/projects/foo >>> >> > 339744908 /mnt/lfs1/projects/foo >>> >> > >>> >> > # grep signal_cache /proc/slabinfo >>> >> > signal_cache 164724 164724 1152 28 8 : tunables 0 0 >>> >> 0 >>> >> > : slabdata 5883 5883 0 >>> >> > >>> >> > # slabtop -s c -o | head -n 20 >>> >> > Active / Total Objects (% used) : 3660791 / 3662863 (99.9%) >>> >> > Active / Total Slabs (% used) : 93019 / 93019 (100.0%) >>> >> > Active / Total Caches (% used) : 72 / 107 (67.3%) >>> >> > Active / Total Size (% used) : 836474.91K / 837502.16K (99.9%) >>> >> > Minimum / Average / Maximum Object : 0.01K / 0.23K / 12.75K >>> >> > >>> >> > OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME >>> >> > >>> >> > 164724 164724 100% 1.12K 5883 28 188256K signal_cache >>> >> > >>> >> > 331712 331712 100% 0.50K 10366 32 165856K ldlm_locks >>> >> > >>> >> > 656896 656896 100% 0.12K 20528 32 82112K kmalloc-128 >>> >> > >>> >> > 340200 339971 99% 0.19K 8100 42 64800K kmalloc-192 >>> >> > >>> >> > 162838 162838 100% 0.30K 6263 26 50104K osc_object_kmem >>> >> > >>> >> > 744192 744192 100% 0.06K 11628 64 46512K kmalloc-64 >>> >> > >>> >> > 205128 205128 100% 0.19K 4884 42 39072K dentry >>> >> > >>> >> > 4268 4256 99% 8.00K 1067 4 34144K kmalloc-8192 >>> >> > >>> >> > 162978 162978 100% 0.17K 3543 46 28344K vvp_object_kmem >>> >> > >>> >> > 162792 162792 100% 0.16K 6783 24 27132K >>> >> kvm_mmu_page_header >>> >> > >>> >> > 162825 162825 100% 0.16K 6513 25 26052K sigqueue >>> >> > >>> >> > 16368 16368 100% 1.02K 528 31 16896K nfs_inode_cache >>> >> > >>> >> > 20385 20385 100% 0.58K 755 27 12080K inode_cache >>> >> > >>> >> > >>> >> > Repeat that for more (and bigger) directories and slab cache added up >>> to >>> >> > more than half the memory on this 24GB node. >>> >> > >>> >> > This is with CentOS-7.6 and lustre-2.10.5_ddn6. >>> >> > >>> >> > I worked around the problem by tackling the "ldlm_locks" memory usage >>> >> with: >>> >> > # lctl set_param ldlm.namespaces.lfs*.lru_max_age=10000 >>> >> > >>> >> > ...but I did not find a way to reduce the "signal_cache". >>> >> > >>> >> > Regards, >>> >> > Nathan >>> >> >>> > >>> > >>> > -- >>> > *Jacek Tomaka* >>> > Geophysical Software Developer >>> > >>> > >>> > >>> > >>> > >>> > >>> > *DownUnder GeoSolutions* >>> > 76 Kings Park Road >>> > West Perth 6005 WA, Australia >>> > *tel *+61 8 9287 4143 <+61%208%209287%204143> >>> > [email protected] >>> > *www.dug.com <http://www.dug.com>* >>> >> >> >> -- >> *Jacek Tomaka* >> Geophysical Software Developer >> >> >> >> >> >> >> *DownUnder GeoSolutions* >> 76 Kings Park Road >> West Perth 6005 WA, Australia >> *tel *+61 8 9287 4143 <+61%208%209287%204143> >> [email protected] >> *www.dug.com <http://www.dug.com>*
signature.asc
Description: PGP signature
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
