I did : for i in {1..100}; do cat /proc/36960/stack >$i; sleep 1; done in one bash and on the other one(36960): time -p echo 3 >/proc/sys/vm/drop_caches
It took about two minutes, unfortunately most of the time it claims that it was not doing anything kernel side: [<ffffffffffffffff>] 0xffffffffffffffff with the exception of two files, at 32 sec and 73 sec: root@xxx xxx]# cat 32 [<ffffffffc11231db>] cl_sync_file_range+0x2db/0x380 [lustre] [<ffffffffffffffff>] 0xffffffffffffffff [root@xxx xxx]# cat 73 [<ffffffffc11231db>] cl_sync_file_range+0x2db/0x380 [lustre] [<ffffffffc11330f6>] ll_delete_inode+0xa6/0x1c0 [lustre] [<ffffffff8121d729>] evict+0xa9/0x180 [<ffffffff8121d83e>] dispose_list+0x3e/0x50 [<ffffffff8121e834>] prune_icache_sb+0x174/0x340 [<ffffffff81203863>] prune_super+0x143/0x170 [<ffffffff81195443>] shrink_slab+0x163/0x330 [<ffffffff812655f3>] drop_caches_sysctl_handler+0xc3/0x120 [<ffffffff8127c203>] proc_sys_call_handler+0xd3/0xf0 [<ffffffff8127c234>] proc_sys_write+0x14/0x20 [<ffffffff81200cad>] vfs_write+0xbd/0x1e0 [<ffffffff81201abf>] SyS_write+0x7f/0xe0 [<ffffffff816b5292>] tracesys+0xdd/0xe2 [<ffffffffffffffff>] 0xffffffffffffffff also after unmounting lustre fs and removing all modules i could relate to lustre i could still see some vvp_object_kmem, is it expected? [root@xxx xxx]# rmmod obdclass ptlrpc ksocklnd libcfs lnet lustre fid mdc osc cnetmgc fld lmv lov; rmmod: ERROR: Module obdclass is not currently loaded rmmod: ERROR: Module ptlrpc is not currently loaded rmmod: ERROR: Module ksocklnd is not currently loaded rmmod: ERROR: Module libcfs is not currently loaded rmmod: ERROR: Module lnet is not currently loaded rmmod: ERROR: Module lustre is not currently loaded rmmod: ERROR: Module fid is not currently loaded rmmod: ERROR: Module mdc is not currently loaded rmmod: ERROR: Module osc is not currently loaded rmmod: ERROR: Module cnetmgc is not currently loaded rmmod: ERROR: Module fld is not currently loaded rmmod: ERROR: Module lmv is not currently loaded rmmod: ERROR: Module lov is not currently loaded [root@xxx xxx]# cat /proc/slabinfo |grep vvp vvp_object_kmem 32982 33212 176 46 2 : tunables 0 0 0 : slabdata 722 722 0 Regards. Jacek Tomaka On Tue, Apr 16, 2019 at 11:18 AM Jacek Tomaka <jac...@dug.com> wrote: > >That would be interesting. About a dozen copies of > > cat /proc/$PID/stack > >taken in quick succession would be best, where $PID is the pid of > >the shell process which wrote to drop_caches. > > Will do later today. I have found a candidate node with the problem, just > need to wait for the current task to finish. > > >signal_cache should have one entry for each process (or thread-group). > > That is what i thought as well, looking at the kernel source, allocations > from > signal_cache happen only during fork. > > >It holds a the signal_struct structure that is shared among the threads > >in a group. > >So 3.7 million signal_structs suggests there are 3.7 million processes > >on the system. I don't think Linux supports more that 4 million, so > >that is one very busy system. > > Not as much. > Top shows: > Tasks: 3048 total, 273 running, 2775 sleeping, 0 stopped, 0 zombie > slabinfo (note that this is a different node than in my original email). > > slabinfo - version: 2.1 > # name <active_objs> <num_objs> <objsize> <objperslab> > <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata > <active_slabs> <num_slabs> <sharedavail> > nfs_direct_cache 0 0 352 46 4 : tunables 0 0 0 > : slabdata 0 0 0 > nfs_commit_data 46 46 704 46 8 : tunables 0 0 0 > : slabdata 1 1 0 > nfs_inode_cache 25110 25110 1048 31 8 : tunables 0 0 0 > : slabdata 810 810 0 > fscache_cookie_jar 552 552 88 46 1 : tunables 0 0 > 0 : slabdata 12 12 0 > iser_descriptors 0 0 832 39 8 : tunables 0 0 0 > : slabdata 0 0 0 > t10_alua_lu_gp_cache 40 40 200 40 2 : tunables 0 > 0 0 : slabdata 1 1 0 > t10_pr_reg_cache 0 0 696 47 8 : tunables 0 0 0 > : slabdata 0 0 0 > se_sess_cache 10728 10728 896 36 8 : tunables 0 0 0 > : slabdata 298 298 0 > kcopyd_job 0 0 3312 9 8 : tunables 0 0 0 > : slabdata 0 0 0 > dm_uevent 0 0 2608 12 8 : tunables 0 0 0 > : slabdata 0 0 0 > dm_rq_target_io 0 0 136 60 2 : tunables 0 0 0 > : slabdata 0 0 0 > nfs4_layout_stateid 0 0 296 55 4 : tunables 0 0 > 0 : slabdata 0 0 0 > nfsd4_delegations 0 0 240 68 4 : tunables 0 0 0 > : slabdata 0 0 0 > nfsd4_files 0 0 288 56 4 : tunables 0 0 0 > : slabdata 0 0 0 > nfsd4_lockowners 0 0 400 40 4 : tunables 0 0 0 > : slabdata 0 0 0 > nfsd4_openowners 0 0 440 74 8 : tunables 0 0 0 > : slabdata 0 0 0 > rpc_inode_cache 1122 1122 640 51 8 : tunables 0 0 0 > : slabdata 22 22 0 > vvp_object_kmem 5805496 5819230 176 46 2 : tunables 0 0 > 0 : slabdata 126505 126505 0 > ll_thread_kmem 28341 28341 344 47 4 : tunables 0 0 0 > : slabdata 603 603 0 > lov_session_kmem 28636 29370 592 55 8 : tunables 0 0 0 > : slabdata 534 534 0 > osc_extent_kmem 6410367 6423408 168 48 2 : tunables 0 0 > 0 : slabdata 133821 133821 0 > osc_thread_kmem 13409 13453 2832 11 8 : tunables 0 0 0 > : slabdata 1223 1223 0 > osc_object_kmem 6401946 6417982 304 53 4 : tunables 0 0 > 0 : slabdata 121094 121094 0 > ldlm_locks 120640 120960 512 64 8 : tunables 0 0 0 > : slabdata 1890 1890 0 > ptlrpc_cache 86142 86142 768 42 8 : tunables 0 0 0 > : slabdata 2051 2051 0 > ll_import_cache 0 0 1480 22 8 : tunables 0 0 0 > : slabdata 0 0 0 > ll_obdo_cache 21216 21216 208 78 4 : tunables 0 0 0 > : slabdata 272 272 0 > ll_obd_dev_cache 72 72 3960 8 8 : tunables 0 0 0 > : slabdata 9 9 0 > ext4_groupinfo_4k 240 240 136 60 2 : tunables 0 0 0 > : slabdata 4 4 0 > ext4_inode_cache 74776 78275 1032 31 8 : tunables 0 0 0 > : slabdata 2525 2525 0 > ext4_xattr 0 0 88 46 1 : tunables 0 0 0 > : slabdata 0 0 0 > ext4_free_data 0 0 64 64 1 : tunables 0 0 0 > : slabdata 0 0 0 > ext4_allocation_context 17408 17408 128 64 2 : tunables 0 > 0 0 : slabdata 272 272 0 > ext4_io_end 15232 15232 72 56 1 : tunables 0 0 0 > : slabdata 272 272 0 > ext4_extent_status 254554 256938 40 102 1 : tunables 0 0 > 0 : slabdata 2519 2519 0 > jbd2_journal_handle 0 0 48 85 1 : tunables 0 0 > 0 : slabdata 0 0 0 > jbd2_journal_head 0 0 112 73 2 : tunables 0 0 0 > : slabdata 0 0 0 > jbd2_revoke_table_s 0 0 16 256 1 : tunables 0 0 > 0 : slabdata 0 0 0 > jbd2_revoke_record_s 0 0 32 128 1 : tunables 0 > 0 0 : slabdata 0 0 0 > ip6_dst_cache 2701 2701 448 73 8 : tunables 0 0 0 > : slabdata 37 37 0 > RAWv6 286 286 1216 26 8 : tunables 0 0 0 > : slabdata 11 11 0 > UDPLITEv6 0 0 1216 26 8 : tunables 0 0 0 > : slabdata 0 0 0 > UDPv6 4550 4550 1216 26 8 : tunables 0 0 0 > : slabdata 175 175 0 > tw_sock_TCPv6 64 64 256 64 4 : tunables 0 0 0 > : slabdata 1 1 0 > TCPv6 4050 4050 2176 15 8 : tunables 0 0 0 > : slabdata 270 270 0 > cfq_io_cq 0 0 120 68 2 : tunables 0 0 0 > : slabdata 0 0 0 > cfq_queue 0 0 232 70 4 : tunables 0 0 0 > : slabdata 0 0 0 > bsg_cmd 0 0 312 52 4 : tunables 0 0 0 > : slabdata 0 0 0 > mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 > 0 : slabdata 1 1 0 > hugetlbfs_inode_cache 71992 79288 608 53 8 : tunables 0 > 0 0 : slabdata 1496 1496 0 > dquot 0 0 256 64 4 : tunables 0 0 0 > : slabdata 0 0 0 > userfaultfd_ctx_cache 0 0 192 42 2 : tunables 0 > 0 0 : slabdata 0 0 0 > fanotify_event_info 7957 7957 56 73 1 : tunables 0 0 > 0 : slabdata 109 109 0 > pid_namespace 0 0 2200 14 8 : tunables 0 0 0 > : slabdata 0 0 0 > posix_timers_cache 17952 17952 248 66 4 : tunables 0 0 > 0 : slabdata 272 272 0 > UDP-Lite 0 0 1088 30 8 : tunables 0 0 0 > : slabdata 0 0 0 > flow_cache 33488 33488 144 56 2 : tunables 0 0 0 > : slabdata 598 598 0 > xfrm_dst_cache 29624 29624 576 56 8 : tunables 0 0 0 > : slabdata 529 529 0 > UDP 8190 8190 1088 30 8 : tunables 0 0 0 > : slabdata 273 273 0 > tw_sock_TCP 14656 14656 256 64 4 : tunables 0 0 0 > : slabdata 229 229 0 > TCP 4478 4544 1984 16 8 : tunables 0 0 0 > : slabdata 284 284 0 > inotify_inode_mark 7176 7176 88 46 1 : tunables 0 0 > 0 : slabdata 156 156 0 > scsi_data_buffer 0 0 24 170 1 : tunables 0 0 0 > : slabdata 0 0 0 > blkdev_queue 14 14 2256 14 8 : tunables 0 0 0 > : slabdata 1 1 0 > blkdev_ioc 21216 21216 104 78 2 : tunables 0 0 0 > : slabdata 272 272 0 > user_namespace 0 0 480 68 8 : tunables 0 0 0 > : slabdata 0 0 0 > dmaengine-unmap-128 30 30 1088 30 8 : tunables 0 0 > 0 : slabdata 1 1 0 > sock_inode_cache 15708 15708 640 51 8 : tunables 0 0 0 > : slabdata 308 308 0 > net_namespace 0 0 5184 6 8 : tunables 0 0 0 > : slabdata 0 0 0 > Acpi-ParseExt 26600 26600 72 56 1 : tunables 0 0 0 > : slabdata 475 475 0 > Acpi-State 510 510 80 51 1 : tunables 0 0 0 > : slabdata 10 10 0 > > > Unless... the final "put" of a task_struct happens via call_rcu - so it > > can be delayed a while, normally 10s of milliseconds, but it can take > > seconds to clear a large backlog. > > So if you have lots of processes being created and destroyed very > > quickly, then you might get a backlog of task_struct, and the associated > > signal_struct, waiting to be destroyed. > > The node from my original mail has been idle for days before i did the > test described. > > >However, if the task_struct slab were particularly big, I suspect you > >would have included it in the list of large slabs - but you didn't. > >If signal_cache has more active entries than task_struct, then something > >has gone seriously wrong somewhere. > > Indeed this is the case. Number of tasks and tasks structs are way smaller > than the number of signal cache structs. > > >I doubt this problem is related to lustre. > > Hmm. Interesting. Looks like __put_task_struct will call into > put_signal_struct > which > will not free the signal that is referenced by sth. > > I wonder if this could be related to the log entries we see : > _slurm_cgroup_destroy: problem deleting step cgroup path > /cgroup/freezer/slurm/uid_1772/job_33959278/step_batch: Device or resource > busy > And we are running in nohz_full, so it is going to be interesting problem > to diagnose... > > But this seems to be going off on a tangent. Still, thank you for the > useful hints and analysis. > > Jacek Tomaka > > On Tue, Apr 16, 2019 at 7:17 AM NeilBrown <ne...@suse.com> wrote: > >> On Mon, Apr 15 2019, Jacek Tomaka wrote: >> >> > Thanks Patrick for getting the ball rolling! >> > >> >>1/ w.r.t drop_caches, "2" is *not* "inode and dentry". The '2' bit >> >> causes all registered shrinkers to be run, until they report there is >> >> nothing left that can be discarded. If this is taking 10 minutes, >> >> then it seems likely that some shrinker is either very inefficient, or >> >> is reporting that there is more work to be done, when really there >> >> isn't. >> > >> > This is pretty common problem on this hardware. KNL's CPU is running >> > at ~1.3GHz so anything that is not multi threaded can take a few times >> more >> > than on "normal" XEON. While it would be nice to improve this (by >> running >> > it in mutliple threads), >> > this is not the problem here. However i can provide you with kernel call >> > stack >> > next time i see it if you are interested. >> >> That would be interesting. About a dozen copies of >> cat /proc/$PID/stack >> taken in quick succession would be best, where $PID is the pid of >> the shell process which wrote to drop_caches. >> >> > >> > >> >> 1a/ "echo 3 > drop_caches" does the easy part of memory reclaim: it >> >> reclaims anything that can be reclaimed immediately. >> > >> > Awesome. I would just like to know how much easily available memory >> > there is on the system without actually reclaiming it and seeing, >> ideally >> > using >> > normal kernel mechanisms but if lustre provides a procfs entry where i >> can >> > get it, it will solve my immediate problem. >> > >> >>4/ Patrick is right that accounting is best-effort. But we do want it >> >> to improve. >> > >> > Accounting looks better when Lustre is not involved ;) Seriosly, how >> > can i help? Should i raise a bug? Try to provide a patch? >> > >> >>Just last week there was a report >> >> https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/ >> >> about making slab-allocated objects movable. If/when that gets off >> >> the ground, it should help the fragmentation problem, so more of the >> >> pages listed as reclaimable should actually be so. >> > >> > This is a very interesting article. While memory fragmentation makes it >> > more >> > difficult to use huge pages, it is not directly related to the problem >> of >> > lustre kernel >> > memory allocation accounting. It will be good to see movable slabs, >> though. >> > >> > Also i am not sure how the high signal_cache can be explained and if >> > anything can be >> > done on the Lustre level? >> >> signal_cache should have one entry for each process (or thread-group). >> It holds a the signal_struct structure that is shared among the threads >> in a group. >> So 3.7 million signal_structs suggests there are 3.7 million processes >> on the system. I don't think Linux supports more that 4 million, so >> that is one very busy system. >> Unless... the final "put" of a task_struct happens via call_rcu - so it >> can be delayed a while, normally 10s of milliseconds, but it can take >> seconds to clear a large backlog. >> So if you have lots of processes being created and destroyed very >> quickly, then you might get a backlog of task_struct, and the associated >> signal_struct, waiting to be destroyed. >> However, if the task_struct slab were particularly big, I suspect you >> would have included it in the list of large slabs - but you didn't. >> If signal_cache has more active entries than task_struct, then something >> has gone seriously wrong somewhere. >> >> I doubt this problem is related to lustre. >> >> NeilBrown >> > > > -- > *Jacek Tomaka* > Geophysical Software Developer > > > > > > > *DownUnder GeoSolutions* > 76 Kings Park Road > West Perth 6005 WA, Australia > *tel *+61 8 9287 4143 <+61%208%209287%204143> > jac...@dug.com > *www.dug.com <http://www.dug.com>* > -- *Jacek Tomaka* Geophysical Software Developer *DownUnder GeoSolutions* 76 Kings Park Road West Perth 6005 WA, Australia *tel *+61 8 9287 4143 <+61%208%209287%204143> jac...@dug.com *www.dug.com <http://www.dug.com>*
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org