Re: [SOLVED] Re: Strange behavior after running under high load
Konstantin Belousov writes: > > B) We lack a nuanced call-back to tell the subsystems to release some of > > their memory "without major delay". > The delay in the wall clock sense does not drive the issue. I didnt say anything about "wall clock" and you're missing my point by a wide margin. We need to make major memory consumers, like vnodes take action *before* shortages happen, so that *when* they happen, a lot of memory can be released to relive them. > We cannot expect any io to proceed while we are low on memory [...] Which is precisely why the top level goal should be for that to never happen, while still allowing the freeable" memory to be used as a cache as much as possible. > > C) We have never attempted to enlist userland, where jemalloc often hang on > > to a lot of unused VM pages. > > > The userland does not add to this problem, [...] No, but userland can help solve it: The unused pages from jemalloc/userland can very quickly be released to relieve any imminent shortage the kernel might have. As can pages from vnodes, and for that matter socket buffers. But there are always costs, actual costs, ie: what it will take to release the memory (locking, VM mappings, washing) and potential costs (lack of future caching opportunities). These costs need to be presented to the central memory allocator, so when it decides back-pressure is appropriate, it can decide who to punk for how much memory. > But normally operating system does not have an issue with user pages. Only if you disregard all non-UNIX operating systems. Many other kernels have cooperated with userland to balance memory (and for that matter disk-space). Just imagine how much better the desktop experience would be, if we could send SIGVM to firefox to tell it stop being a memory-pig. (At least two of the major operating systems in the desktop world does something like that today.) > Io latency is not the factor there. We must avoid situations where > instantiating a vnode stalls waiting for KVA to appear, similarly we > must avoid system state where vnodes allocation consumed so much kmem > that other allocations stall. My argument is the precise opposite: We must make vnodes and the allocations they cause responsive to the sytems overall memory availability, well in advance of the shortage happening in the first place. > Quite indicative is that we do not shrink the vnode list on low memory > events. Vnlru also does not account for the memory pressure. The only reason we do not, is that we cannot tell definitively if freeing a vnode will cause disk-I/O (which may not matter with SSD's) or even how much memory it might free, if anything. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [SOLVED] Re: Strange behavior after running under high load
On Sun, Apr 04, 2021 at 07:01:44PM +, Poul-Henning Kamp wrote: > > Konstantin Belousov writes: > > > But what would you provide as the input for PID controller, and what would > > be the targets? > > Viewing this purely as a vnode related issue is wrong, this is about memory > allocation in general. > > We may or may not want a PID regulator, but putting it on counts of vnode > would not improve things, precisely, as you point out, because the amount of > memory a vnode ties up has enormous variance. > Yes > > We should focus on the end goal: To ensure "sufficient" memory can always be > allocated for any purpose "without major delay". > and no > > Architecturally there are three major problems: > > A) While each subsystem generally have a good idea about memory that can be > released "without major delay", the information does not trickle up through a > summarizing NUMA aware tree. > > B) We lack a nuanced call-back to tell the subsystems to release some of > their memory "without major delay". The delay in the wall clock sense does not drive the issue. We cannot expect any io to proceed while we are low on memory, in the sense that allocators cannot respond right now. More and more, our io subsystem requires allocating memory to make any progress with io. This is already quite bad with geom, although some hacks make it not too outstanding. It is very bad with ZFS, where swap on zvols causes deadlocks almost immediately. > > C) We have never attempted to enlist userland, where jemalloc often hang on > to a lot of unused VM pages. > The userland does not add to this problem, because pagedaemon typically has enough processing power to convert user-allocated pages into usable clean or free pages. Of course, if there is no swap and dirty anon page cannot be launder, the issue would accumulate. But normally operating system does not have an issue with user pages. > > As far as vnodes go: > > > It used to be that "without major delay" meant "without disk-I/O" which again > led to the "dirty buffers/VM pages" heuristic. > > With microsecond SSD backing store, that heuristic is not only invalid, it is > down-right harmful in many cases. > > GEOM maintains estimates of per-provider latency and VM+VFS should use that > to schedule write-back so that more of it happens outside rush-hour, in order > to increase the amount of memory which can be released "without major delay". > > Today that happens largely as a side effect of the periodic syncer, which > does a really bad job at it, because it still expects VAX-era hardware > performance and workloads. > Io latency is not the factor there. We must avoid situations where instantiating a vnode stalls waiting for KVA to appear, similarly we must avoid system state where vnodes allocation consumed so much kmem that other allocations stall. Quite indicative is that we do not shrink the vnode list on low memory events. Vnlru also does not account for the memory pressure. Problem is that it is not clear how to express that relations between safe allocators state and our desire to cache file system data, which is bound to the vnode identity. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [SOLVED] Re: Strange behavior after running under high load
Konstantin Belousov writes: > But what would you provide as the input for PID controller, and what would be > the targets? Viewing this purely as a vnode related issue is wrong, this is about memory allocation in general. We may or may not want a PID regulator, but putting it on counts of vnode would not improve things, precisely, as you point out, because the amount of memory a vnode ties up has enormous variance. We should focus on the end goal: To ensure "sufficient" memory can always be allocated for any purpose "without major delay". Architecturally there are three major problems: A) While each subsystem generally have a good idea about memory that can be released "without major delay", the information does not trickle up through a summarizing NUMA aware tree. B) We lack a nuanced call-back to tell the subsystems to release some of their memory "without major delay". C) We have never attempted to enlist userland, where jemalloc often hang on to a lot of unused VM pages. As far as vnodes go: It used to be that "without major delay" meant "without disk-I/O" which again led to the "dirty buffers/VM pages" heuristic. With microsecond SSD backing store, that heuristic is not only invalid, it is down-right harmful in many cases. GEOM maintains estimates of per-provider latency and VM+VFS should use that to schedule write-back so that more of it happens outside rush-hour, in order to increase the amount of memory which can be released "without major delay". Today that happens largely as a side effect of the periodic syncer, which does a really bad job at it, because it still expects VAX-era hardware performance and workloads. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [SOLVED] Re: Strange behavior after running under high load
On Sun, Apr 04, 2021 at 08:45:41AM -0600, Warner Losh wrote: > On Sun, Apr 4, 2021, 5:51 AM Mateusz Guzik wrote: > > > On 4/3/21, Poul-Henning Kamp wrote: > > > > > > Mateusz Guzik writes: > > > > > >> It is high because of this: > > >> msleep(_sig, _list_mtx, PVFS, "vlruwk", > > >> hz); > > >> > > >> i.e. it literally sleeps for 1 second. > > > > > > Before the line looked like that, it slept on "lbolt" aka "lightning > > > bolt" which was woken once a second. > > > > > > The calculations which come up with those "constants" have always > > > been utterly bogus math, not quite "square-root of shoe-size > > > times sun-angle in Patagonia", but close. > > > > > > The original heuristic came from university environments with tons of > > > students doing assignments and nethack behind VT102 terminals, on > > > filesystems where files only seldom grew past 100KB, so it made sense > > > to scale number of vnodes to how much RAM was in the system, because > > > that also scaled the size of the buffer-cache. > > > > > > With a merged VM buffer-cache, whatever validity that heuristic had > > > was lost, and we tweaked the bogomath in various ways until it > > > seemed to mostly work, trusting the users for which it did not, to > > > tweak things themselves. > > > > > > Please dont tweak the Finagle Constants again. > > > > > > Rip all that crap out and come up with something fundamentally better. > > > > > > > Some level of pacing is probably useful to control total memory use -- > > there can be A LOT of memory tied up in mere fact that vnode is fully > > cached. imo the thing to do is to come up with some watermarks to be > > revisited every 1-2 years and to change the behavior when they get > > exceeded -- try to whack some stuff but in face of trouble just go > > ahead and alloc without sleep 1. Should the load spike sort itself > > out, vnlru will slowly get things down to the watermark. If the > > watermark is too low, maybe it can autotune. Bottom line is that even > > with the current idea of limiting preferred total vnode count, the > > corner case behavior can be drastically better suffering SOME perf > > loss from recycling vnodes, but not sleeping for a second for every > > single one. > > > > I'd suggest that going directly to a PID to control this would be better > than the watermarks. That would give a smoother response than high/low > watermarks would. While you'd need some level to keep things at still, the > laundry stuff has shown the precise level of that level is less critical > than the watermarks. But what would you provide as the input for PID controller, and what would be the targets? The main reason for the (almost) hard cap on the number of vnodes is not that excessive number of vnodes is harmful by itself. Each allocated vnode typically implies existence of several second-order allocations that accumulate into significant KVA usage: - filesystem inode - vm object - namecache entries There are usually even more allocations, third-order, for instance UFS inode carries a pointer to the dinode copy in RAM, and possibly EA area. And of course, the fact that vnode names pages in the page cache owned by corresponding file, i.e. amount of allocated vnodes regulates amount of work for pagedaemon. We currently trying to put some rational limit for total number of vnodes, estimating both KVA and physical memory consumed by them. If you remove that limit, you need to ensure that we do not create OOM situation either for KVA or for physical memory just by creating too many vnodes, otherwise system cannot get out of it. So there are some combinations of machine config (RAM) and loads where default settings are arguably low. Raising the limits needs to handle the indirect resource usage from vnode. I do not know how to write the feedback formula, taking into account all the consequences of the vnode existence, and that effects depend also on the underlying filesystem and patterns of VM paging usage. In this sense ZFS is probably simplest case, because its caching subsystem is autonomous. While UFS or NFS are tightly integrated with VM. > > Warner > > I think the notion of 'struct vnode' being a separately allocated > > object is not very useful and it comes with complexity (and happens to > > suffer from several bugs). > > > > That said, the easiest and safest thing to do in the meantime is to > > bump the limit. Perhaps the sleep can be whacked as it is which would > > largely sort it out. > > > > -- > > Mateusz Guzik > > ___ > > freebsd-current@freebsd.org mailing list > > https://lists.freebsd.org/mailman/listinfo/freebsd-current > > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" > > > ___ > freebsd-current@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to
Re: [SOLVED] Re: Strange behavior after running under high load
On Sun, Apr 4, 2021, 5:51 AM Mateusz Guzik wrote: > On 4/3/21, Poul-Henning Kamp wrote: > > > > Mateusz Guzik writes: > > > >> It is high because of this: > >> msleep(_sig, _list_mtx, PVFS, "vlruwk", > >> hz); > >> > >> i.e. it literally sleeps for 1 second. > > > > Before the line looked like that, it slept on "lbolt" aka "lightning > > bolt" which was woken once a second. > > > > The calculations which come up with those "constants" have always > > been utterly bogus math, not quite "square-root of shoe-size > > times sun-angle in Patagonia", but close. > > > > The original heuristic came from university environments with tons of > > students doing assignments and nethack behind VT102 terminals, on > > filesystems where files only seldom grew past 100KB, so it made sense > > to scale number of vnodes to how much RAM was in the system, because > > that also scaled the size of the buffer-cache. > > > > With a merged VM buffer-cache, whatever validity that heuristic had > > was lost, and we tweaked the bogomath in various ways until it > > seemed to mostly work, trusting the users for which it did not, to > > tweak things themselves. > > > > Please dont tweak the Finagle Constants again. > > > > Rip all that crap out and come up with something fundamentally better. > > > > Some level of pacing is probably useful to control total memory use -- > there can be A LOT of memory tied up in mere fact that vnode is fully > cached. imo the thing to do is to come up with some watermarks to be > revisited every 1-2 years and to change the behavior when they get > exceeded -- try to whack some stuff but in face of trouble just go > ahead and alloc without sleep 1. Should the load spike sort itself > out, vnlru will slowly get things down to the watermark. If the > watermark is too low, maybe it can autotune. Bottom line is that even > with the current idea of limiting preferred total vnode count, the > corner case behavior can be drastically better suffering SOME perf > loss from recycling vnodes, but not sleeping for a second for every > single one. > I'd suggest that going directly to a PID to control this would be better than the watermarks. That would give a smoother response than high/low watermarks would. While you'd need some level to keep things at still, the laundry stuff has shown the precise level of that level is less critical than the watermarks. Warner I think the notion of 'struct vnode' being a separately allocated > object is not very useful and it comes with complexity (and happens to > suffer from several bugs). > > That said, the easiest and safest thing to do in the meantime is to > bump the limit. Perhaps the sleep can be whacked as it is which would > largely sort it out. > > -- > Mateusz Guzik > ___ > freebsd-current@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" > ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [SOLVED] Re: Strange behavior after running under high load
On 4/3/21, Poul-Henning Kamp wrote: > > Mateusz Guzik writes: > >> It is high because of this: >> msleep(_sig, _list_mtx, PVFS, "vlruwk", >> hz); >> >> i.e. it literally sleeps for 1 second. > > Before the line looked like that, it slept on "lbolt" aka "lightning > bolt" which was woken once a second. > > The calculations which come up with those "constants" have always > been utterly bogus math, not quite "square-root of shoe-size > times sun-angle in Patagonia", but close. > > The original heuristic came from university environments with tons of > students doing assignments and nethack behind VT102 terminals, on > filesystems where files only seldom grew past 100KB, so it made sense > to scale number of vnodes to how much RAM was in the system, because > that also scaled the size of the buffer-cache. > > With a merged VM buffer-cache, whatever validity that heuristic had > was lost, and we tweaked the bogomath in various ways until it > seemed to mostly work, trusting the users for which it did not, to > tweak things themselves. > > Please dont tweak the Finagle Constants again. > > Rip all that crap out and come up with something fundamentally better. > Some level of pacing is probably useful to control total memory use -- there can be A LOT of memory tied up in mere fact that vnode is fully cached. imo the thing to do is to come up with some watermarks to be revisited every 1-2 years and to change the behavior when they get exceeded -- try to whack some stuff but in face of trouble just go ahead and alloc without sleep 1. Should the load spike sort itself out, vnlru will slowly get things down to the watermark. If the watermark is too low, maybe it can autotune. Bottom line is that even with the current idea of limiting preferred total vnode count, the corner case behavior can be drastically better suffering SOME perf loss from recycling vnodes, but not sleeping for a second for every single one. I think the notion of 'struct vnode' being a separately allocated object is not very useful and it comes with complexity (and happens to suffer from several bugs). That said, the easiest and safest thing to do in the meantime is to bump the limit. Perhaps the sleep can be whacked as it is which would largely sort it out. -- Mateusz Guzik ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [SOLVED] Re: Strange behavior after running under high load
Mateusz Guzik writes: > It is high because of this: > msleep(_sig, _list_mtx, PVFS, "vlruwk", hz); > > i.e. it literally sleeps for 1 second. Before the line looked like that, it slept on "lbolt" aka "lightning bolt" which was woken once a second. The calculations which come up with those "constants" have always been utterly bogus math, not quite "square-root of shoe-size times sun-angle in Patagonia", but close. The original heuristic came from university environments with tons of students doing assignments and nethack behind VT102 terminals, on filesystems where files only seldom grew past 100KB, so it made sense to scale number of vnodes to how much RAM was in the system, because that also scaled the size of the buffer-cache. With a merged VM buffer-cache, whatever validity that heuristic had was lost, and we tweaked the bogomath in various ways until it seemed to mostly work, trusting the users for which it did not, to tweak things themselves. Please dont tweak the Finagle Constants again. Rip all that crap out and come up with something fundamentally better. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: [SOLVED] Re: Strange behavior after running under high load
On 4/2/21, Stefan Esser wrote: > Am 28.03.21 um 16:39 schrieb Stefan Esser: >> After a period of high load, my now idle system needs 4 to 10 seconds to >> run any trivial command - even after 20 minutes of no load ... >> >> >> I have run some Monte-Carlo simulations for a few hours, with initially > 35 >> processes running in parallel for some 10 seconds each. >> >> The load decreased over time since some parameter sets were faster to >> process. >> All in all 63000 processes ran within some 3 hours. >> >> When the system became idle, interactive performance was very bad. >> Running >> any trivial command (e.g. uptime) takes some 5 to 10 seconds. Since I >> have >> to have this system working, I plan to reboot it later today, but will >> keep >> it in this state for some more time to see whether this state persists or >> whether the system recovers from it. >> >> Any ideas what might cause such a system state??? > > Seems that Mateusz Guzik was right to mention performance issues when > the system is very low on vnodes. (Thanks!) > > I have been able to reproduce the issue and have checked vnode stats: > > kern.maxvnodes: 620370 > kern.minvnodes: 155092 > vm.stats.vm.v_vnodepgsout: 6890171 > vm.stats.vm.v_vnodepgsin: 18475530 > vm.stats.vm.v_vnodeout: 228516 > vm.stats.vm.v_vnodein: 1592444 > vfs.wantfreevnodes: 155092 > vfs.freevnodes: 47<- obviously too low ... > vfs.vnodes_created: 19554702 > vfs.numvnodes: 621284 > vfs.cache.debug.vnodes_cel_3_failures: 0 > vfs.cache.stats.heldvnodes: 6412 > > The freevnodes value stayed in this region over several minutes, with > typical program start times (e.g. for "uptime") in the region of 10 to > 15 seconds. > > After rising maxvnodes to 2,000,000 form 600,000 the system performance > is restored and I get: > > kern.maxvnodes: 200 > kern.minvnodes: 50 > vm.stats.vm.v_vnodepgsout: 7875198 > vm.stats.vm.v_vnodepgsin: 20788679 > vm.stats.vm.v_vnodeout: 261179 > vm.stats.vm.v_vnodein: 1817599 > vfs.wantfreevnodes: 50 > vfs.freevnodes: 205988<- still a lot higher than wantfreevnodes > vfs.vnodes_created: 19956502 > vfs.numvnodes: 912880 > vfs.cache.debug.vnodes_cel_3_failures: 0 > vfs.cache.stats.heldvnodes: 20702 > > I do not know why the performance impact is so high - there are a few > free vnodes (more than required for the shared libraries to start e.g. > the uptime program). Most probably each attempt to get a vnode triggers > a clean-up attempt that runs for a significant time, but has no chance > to actually reach near the goal of 155k or 500k free vnodes. > It is high because of this: msleep(_sig, _list_mtx, PVFS, "vlruwk", hz); i.e. it literally sleeps for 1 second. The vnode limit is probably too conservative and behavior when limit is reached is rather broken. Probably the thing to do is to let allocations go through while kicking vnlru to free some stuff up. I'll have to sleep on it. > Anyway, kern.maxvnodes can be changed at run-time and it is thus easy > to fix. It seems that no message is logged to report this situation. > A rate limited hint to rise the limit should help other affected users. > > Regards, STefan > > -- Mateusz Guzik ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
[SOLVED] Re: Strange behavior after running under high load
Am 28.03.21 um 16:39 schrieb Stefan Esser: After a period of high load, my now idle system needs 4 to 10 seconds to run any trivial command - even after 20 minutes of no load ... I have run some Monte-Carlo simulations for a few hours, with initially 35 processes running in parallel for some 10 seconds each. The load decreased over time since some parameter sets were faster to process. All in all 63000 processes ran within some 3 hours. When the system became idle, interactive performance was very bad. Running any trivial command (e.g. uptime) takes some 5 to 10 seconds. Since I have to have this system working, I plan to reboot it later today, but will keep it in this state for some more time to see whether this state persists or whether the system recovers from it. Any ideas what might cause such a system state??? Seems that Mateusz Guzik was right to mention performance issues when the system is very low on vnodes. (Thanks!) I have been able to reproduce the issue and have checked vnode stats: kern.maxvnodes: 620370 kern.minvnodes: 155092 vm.stats.vm.v_vnodepgsout: 6890171 vm.stats.vm.v_vnodepgsin: 18475530 vm.stats.vm.v_vnodeout: 228516 vm.stats.vm.v_vnodein: 1592444 vfs.wantfreevnodes: 155092 vfs.freevnodes: 47 <- obviously too low ... vfs.vnodes_created: 19554702 vfs.numvnodes: 621284 vfs.cache.debug.vnodes_cel_3_failures: 0 vfs.cache.stats.heldvnodes: 6412 The freevnodes value stayed in this region over several minutes, with typical program start times (e.g. for "uptime") in the region of 10 to 15 seconds. After rising maxvnodes to 2,000,000 form 600,000 the system performance is restored and I get: kern.maxvnodes: 200 kern.minvnodes: 50 vm.stats.vm.v_vnodepgsout: 7875198 vm.stats.vm.v_vnodepgsin: 20788679 vm.stats.vm.v_vnodeout: 261179 vm.stats.vm.v_vnodein: 1817599 vfs.wantfreevnodes: 50 vfs.freevnodes: 205988 <- still a lot higher than wantfreevnodes vfs.vnodes_created: 19956502 vfs.numvnodes: 912880 vfs.cache.debug.vnodes_cel_3_failures: 0 vfs.cache.stats.heldvnodes: 20702 I do not know why the performance impact is so high - there are a few free vnodes (more than required for the shared libraries to start e.g. the uptime program). Most probably each attempt to get a vnode triggers a clean-up attempt that runs for a significant time, but has no chance to actually reach near the goal of 155k or 500k free vnodes. Anyway, kern.maxvnodes can be changed at run-time and it is thus easy to fix. It seems that no message is logged to report this situation. A rate limited hint to rise the limit should help other affected users. Regards, STefan OpenPGP_signature Description: OpenPGP digital signature