Hi Jim, thank you for your interest! While FUTEX_SWAP seems to be a nonstarter, there is a discussion off-list on how to approach the larger problem of userspace scheduling. A full userspace scheduling patchset is likely to take some time to shape out, but the "core" patches of wait/wake/swap are more or less ready, so I'll probably post an early RFC version here in the next week or two.
CC-ing the maintainers. Thanks, Peter On Wed, Mar 17, 2021 at 10:59 AM Jim Newsome <[email protected]> wrote: > > I'm not well versed in this part of the kernel (ok, any part, really), > but I wanted to chime in from a user perspective that I'm very > interested in this functionality. > > We (Rob + Ryan + I, cc'd) are currently developing the second generation > of the Shadow simulator <https://shadow.github.io/>, which is used by > various researchers and the Tor Project. In this new architecture, > simulated network-application processes (such as tor, browsers, and web > servers) are each run as a native OS process, started by forking and > exec'ing its unmodified binary. We are interested in supporting large > simulations (e.g. 50k+ processes), and expect them to take on the order > of hours or even days to execute, so scalability and performance matters. > > We've prototyped two mechanisms for controlling these simulated > processes, and a third hybrid mechanism that combines the two. I've > mentioned one of these (ptrace) in another thread ("do_wait: make > PIDTYPE_PID case O(1) instead of O(n)"). The other mechanism is to use > an LD_PRELOAD'd shim that implements the libc interface, and > communicates with Shadow via a syscall-like API over IPC. > > So far the most performant version we've tried of this IPC is with a bit > of shared memory and a pair of semaphores. It looks much like the > example in Peter's proposal: > > > a. T1: futex-wake T2, futex-wait > > b. T2: wakes, does what it has been woken to do > > c. T2: futex-wake T1, futex-wait > > We've been able to get the switching costs down using CPU pinning and > SCHED_FIFO. Each physical CPU spends most of its time swapping back and > forth between a Shadow worker thread and an emulated process. Even so, > the new architecture is so far slower than the first generation of > Shadow, which multiplexes the simulated processes into its own handful > of OS processes (but is complex and fragile). > > > With FUTEX_SWAP, steps a and c above can be reduced to one futex > > operation that runs 5-10 times faster. > > IIUC the proposed primitives could let us further improve performance, > and perhaps drop some of the complexity of attempting to control the > scheduler via pinning and SCHED_FIFO.

