On Fri, Apr 15, 2016 at 01:31:55PM +0200, Paolo Bonzini wrote: > [this time including the mailing list] > > This is yet another tiny bit of the multiqueue work, this time affecting > the synchronization infrastructure for coroutines. Currently, coroutines > synchronize between the main I/O thread and the dataplane iothread through > the AioContext lock. However, for multiqueue a single BDS will be used > by multiple iothreads and hence multiple AioContexts. This calls for > a different approach to coroutine synchronization. This series is my > first attempt at it. Besides multiqueue, I think throttling groups > (which can already span multiple AioContexts) could also take advantage > of the new CoMutexes. > > The series has two main parts: > > - it makes CoMutex bind coroutines to an AioContexts. It is of course > still possible to move coroutines around with explicit yield and > qemu_coroutine_enter, but a CoMutex will now reenter the coroutine > where it was running before, rather than in the AioContext of > the previous owner. To do this, a new function aio_co_schedule is > introduced to run a coroutine on a given iothread. I think this could > be used in other places too; for now it lets a CoMutex protect stuff > across multiple AioContexts without them moving around(*). Of course > currently a CoMutex is generally not used across multiple iothreads, > because you have to acquire/release the AioContext around CoMutex > critical sections. However... > > - ... the second change is exactly to make CoMutex thread-safe and remove > the need for a "thread-based" mutex around it. The new CoMutex is > exactly the same as a mutex implementation that you'd find in an > operating system. iothreads are the moral equivalent of CPUs in > a kernel, while coroutines resemble kernel threads running without > preemption on a CPU. Even if you have to take concurrency into account, > the lack of preemption while running coroutines or bottom halves > keeps the complexity at bay. For example, it is easy to synchronize > between qemu_co_mutex_lock's yield and the qemu_coroutine_enter in > aio_co_schedule's bottom half. > > Same as before, CoMutex puts coroutines to sleep with > qemu_coroutine_yield and wake them up with the new aio_co_schedule. > I could have wrapped CoMutex's CoQueue with a "regular" thread mutex or > spinlock. The resulting code would have looked a lot like RFifoLock > (with CoQueue replacing RFifoLock's condition variable). Rather, > inspired by the parallel between coroutines and non-preemptive kernel > threads, I chose to adopt the same lock-free mutex algorithm as OSv. > The algorithm only needs two to four atomic ops for a lock-unlock pair > (two when uncontended). To cover CoQueue, each CoQueue is made to > depend on a CoMutex, similar to condition variables. Most CoQueues > already have a corresponding CoMutex so this is not a big deal; > converting the others is left for a future series. I did this because > CoQueue looks a lot like OSv's waitqueues; so if necessary, we could > even take OSv's support for wait morphing (which avoids the thundering > herd problem) and add it to CoMutex and CoQueue. This may be useful > when making tracked_requests thread-safe. > > Kevin: this has nothing to do with my old plan from Brno, and it's > possibly a lot closer to what you wanted. Your idea was to automatically > release the "thread mutex" when a coroutine yields, I think you should > be fine with not having a thread mutex at all! > > This will need some changes in the formats because, for multiqueue, > CoMutexes would need to be used like "real" thread mutexes. Code like > this: > > ... > qemu_co_mutex_unlock() > ... /* still access shared data, but don't yield */ > qemu_coroutine_yield() > > might be required to use this other pattern: > > ... /* access shared data, but don't yield */ > qemu_co_mutex_unlock() > qemu_coroutine_yield() > > because "adding a second CPU" is already introducing concurrency that > wasn't there before. The "non-preemptive multitasking" reference only > applies to things that access AioContext-local data. This includes the > synchronization primitives implemented in this series, the thread pool, > the Linux AIO support, but not much else. It still simplifies _those_ > though. :) > > Anyhow, we'll always have some BlockDriver that do not support multiqueue, > such as the network protocols. Thus it would be possible to handle the > formats one at a time. raw-posix, raw and qcow2 would already form a > pretty good set, and the first two do not use CoMutex at all. > > The patch has quite a lot of new code, but about half of it is testcases. > The new test covers correctness and (when run with --verbose) also takes a > stab at measuring performance; the results is that performance of CoMutex > is comparable to pthread mutexes. The only snag is that that you cannot > make a direct comparison between CoMutex (fair) and pthread_mutex_t > (unfair). For this reason the testcase also measures performance of a > quick-and-dirty implementation of a fair mutex, based on MCS locks + > futexes. > > There's a lot of meat in the above text, and I hope it will make the code > clearer and compensate for the terse commit messages. :) I'll probably > write a single-threaded testcase too, just to provide some more unit > test comparison of "before" and "after". > > I haven't even started a guest with this patches, let alone run > qemu-iotests... generally the changes are well confined to unit tested > code, patch 2 for example is completely untested. There are a couple > other places that at least need more comments, but I wanted to throw > the patch out for an early review of the general approach. > > Paolo Bonzini (11): > coroutine: use QSIMPLEQ instead of QTAILQ > throttle-groups: restart throttled requests from coroutine context > coroutine: delete qemu_co_enter_next > aio: introduce aio_co_schedule > coroutine-lock: reschedule coroutine on the AioContext it was running on > coroutine-lock: make CoMutex thread-safe > coroutine-lock: add limited spinning to CoMutex > test-aio-multithread: add performance comparison with thread-based mutexes > coroutine-lock: place CoMutex before CoQueue in header > coroutine-lock: add mutex argument to CoQueue APIs > coroutine-lock: make CoRwlock thread-safe and fair > > async.c | 38 ++++ > block/backup.c | 2 +- > block/io.c | 2 +- > block/qcow2-cluster.c | 4 +- > block/sheepdog.c | 2 +- > block/throttle-groups.c | 71 +++++-- > hw/9pfs/9p.c | 2 +- > include/block/aio.h | 14 ++ > include/qemu/atomic.h | 3 + > include/qemu/coroutine.h | 106 ++++++---- > include/qemu/coroutine_int.h | 14 +- > iothread.c | 16 ++ > stubs/iothread.c | 11 ++ > tests/Makefile | 11 +- > tests/iothread.c | 107 ++++++++++ > tests/iothread.h | 25 +++ > tests/test-aio-multithread.c | 452 > +++++++++++++++++++++++++++++++++++++++++++ > trace-events | 8 +- > util/qemu-coroutine-lock.c | 257 ++++++++++++++++++++---- > util/qemu-coroutine.c | 2 +- > 20 files changed, 1037 insertions(+), 110 deletions(-) > create mode 100644 tests/iothread.c > create mode 100644 tests/iothread.h > create mode 100644 tests/test-aio-multithread.c
Looks promising, modulo questions and comments I posted on patches. I wonder whether it would be cleaner to introduce new primitives instead of modifying CoMutex/CoQueue. That way it would be clear whether code is written to be thread-safe or not.
signature.asc
Description: PGP signature