Hello GHC devs, Through the course of reading some recent material posted by Linus Torvalds ^1 ^2, I remembered stumbling upon GHC Trac #3553 ^3 some years ago.
Looking past the harsh tone Linus used in his notes, I think he makes some pretty reasonable points about the problems that can be caused by using spinlocks in userspace. Specifically: - A group of threads yielding while waiting for a spinlock are, in effect, simply trying to coax the scheduler into scheduling the thread that is holding the lock. This is much easier for the scheduler to do correctly with a futex or similar, since the scheduler has some context around which tasks are blocking/waking each other up. - Apparently the Linux scheduler (and maybe other schedulers) has some smarts for preserving cache locality while choosing which physical execution units run which tasks, and reordering threads in the run list in an ad-hoc way with sched_yield() interferes with this mechanism. - Using sched_yield() (or other random interference with run lists) can cause disproportionate havoc on NUMA systems, where jostling around the lock-holding thread by burning time slices and yielding is especially costly. All that said, there is perhaps one concrete advantage (aside from absolute performance) to the current spinlock implementation: the only functionality it requires from the host OS is a yield-like operation. A yielding spinlock occupies a comfortable local optimum, achieving decent performance across platforms with minimal exposure to cross-OS API differences. Revisiting #3553, it seems the only reason that a futex was not used in place of a spinlock with sched_yield() was that, despite all of the points against it, the spinlock code empirically performed better. However, these tests were performed years ago and the futex implementation in the Linux kernel has changed quite a bit. I think there is a healthy amount of empirical evidence from the GHC user community that there are problems afoot with the way parallel GC does locking, and indeed I wonder if the bad cases Linus described are playing a role. Advice like "use -qg" or "use -qnM with small M" is often parroted (this Twitter thread ^4 contains both pieces of advice). Furthermore, I would wager that an RTS using futexes for locking rather than spinning plays nicer with other CPU intensive tasks on the same machine. The "scheduler storm" caused by a large number of threads waiting for a spinlock could have a negative impact on neighboring processes, e.g. a large number of HECs spinning certainly don't do any favors for a busy neighboring DB process. I'm curious if others have thoughts on reviving the futex investigation. Perhaps the futexes provided by modern Linux are more competitive with the current spinlock implementation, or perhaps better scaling on machines with high core counts is worth some cost. In the case that futexes remain handily defeated by yielding spinlocks, it's a worthy endeavor to explain precisely _why_ this happens. Other programs have seen performance issues crop up when the kernel makes subtle changes to how sched_yield() works. ^5 1: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723 2: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752 3: https://gitlab.haskell.org/ghc/ghc/issues/3553 4: https://twitter.com/ndm_haskell/status/1076187051610042368?s=20 5: https://lwn.net/Articles/31462/ _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs