( I'm adding LKML here, because we are getting into "what can the kernel do to help user-space RCU implement priority inheritance cleanly". Sorry for cross-mailing-list posting, but it seems required as this thread is of interest to many groups. )
* Lai Jiangshan ([email protected]) wrote: > On 08/17/2011 04:46 PM, Paolo Bonzini wrote: > > On 08/16/2011 12:58 AM, Lai Jiangshan wrote: > >> These series patches implelent a priority-boost urcu > >> based on pi-lock. > >> > >> Some other locks(especial rcu_gp_lock) should be also > >> priority-aware, these patches did touch them and make > >> the patchset simpler. > > > > While really cool, I found this patchset overly complex. > > > > What we should introduce is abstractions over futexes. This is what > > I did to experimentally port URCU to QEMU---my secret goal since > > commit 806f811 (use kernel style makefile output, 2010-03-01). :) > > Our use of futexes is exceptionally similar to a Windows > > manual-reset event (yes, Windows: > > http://msdn.microsoft.com/en-us/library/system.threading.manualresetevent%28v=vs.80%29.aspx). > > In QEMU I added the manual-reset event and use it in the > > implementation of RCU. > > > > By introducing an abstraction for this, we can make the code a lot > > clearer and secondarily gain in portability. For QEMU portability > > was actually my primary goal, but URCU might have different > > priorities. :) > > > > PI futex support can also be implemented in the same framework. > How? > > > Challenges of userspace priority-boost urcu. Ah! :-) This is the kind of discussion I really want us to have before we choose which PI approach we should follow for Userspace RCU. > > No matter how to design a urcu, update site have to wait for the > started read site. Normal waiting pattern is: > > ----------------------------------- > thread1 thread2 (one of read site) > ... ... > xx_wait(&something); xx_wake(&something); > ... ... > ------------------------------------ > > > Even thread1 is a higher priority thread, thread2 will not be boosted, > because the OS does not know which thread will do "wake(&something);" > > Three approaches can achieve it in my mind. > 1) tell the OS which thread need to be boosted when waiting. The good side of approach (1) is that the Userspace RCU grace period can iterate on all the registered reader threads, keeping some of them in the list of readers to wait for, and moving others to the list of threads where a grace period change has been observed. If there are still readers in the list "to be waited for" after each of these iterations, the grace period needs to retry (in a adaptative way, starting with busy-looping mode for a few loops, and then switching to wait/wakeup). It would be good to be able to let the grace period identify _all_ threads it will be waiting for (it knows that by iterating on the liburcu registry of registered threads already), so that when it calls futex wait, the kernel would know the full list of threads that should get the priority of the waiter. Then, the "wake" of this futex could be special: we could require that _all_ identified threads need to call wakeup for the waiter to be woken up. This makes sense here, because it's pretty much useless to walk on the reader registry if there is still at least one reader to wait for. What I describe here could very well already exist or we might have to extend the futex API to do it. My knowledge of futex PI can certainly gain from being extended. > 2) compete/wait a pi-lock which already held by thread2 (I guess you mean "complete" here) That does not seem to be so efficient, because we end up waiting for one single reader at a time, and thus boosting the priority of one single reader at a time too, when in fact what we really want is to wait for all of them. So each time the reader we happen to wait for wakes us up, we would traverse the reader list all over again, possibly having to wait for other readers. I also think that (2) has an unwanted side-effect in terms of real-time: in the worse-case execution time, the grace period, rather than boosting all reader threads in parallel, ends up boosting each of the reader threads one after the next, for a O(nb reader threads * duration of individual reader critical sections) delay. Ideally, we'd really like to have a O(duration of the longest reader critical section) delay only, which could be provided by (1) by boosting priority of all readers we happen to be waiting for. So I agree that (1) and (2), on a uniprocessor machine, won't end up being much different in practice, because there is only one CPU to execute the reader threads anyway, but this will lead to an important difference in SMP, because you are actually serializing reader priority boosting. > 3) (hide, hard to explain, require kernel changed) This could be interesting to hear :) > 1) is not acceptable, the OS has no such API/syscall, but 1) can be > implemented over 2) If we have to extend the Linux futex ABI to do it, why not ? Similarly, I've proposed sys_membarrier to the Linux community a few months ago, and got really good and positive feedback, and I'm just waiting for the Userspace RCU user community to grow before I re-submit sys_membarrier for inclusion into the mainline Linux kernel. > 2) is simpler. As I explained above, (2) changes the grace period worse-case execution time on SMP machines, so I'm reluctant to go that route. Maybe it looks simpler to implement in the first place because there is no kernel-level change to do, but it would add lots of unwelcome complexity overall in the user-space RCU code, which we are trying very hard to keep as simple as possible. The way I see it, it's not a question of keeping complexity outside or inside of the kernel. Sure, we have to limit the amount of complexity in the kernel as much as possible and do all we can in user-space, but as a system overall, if we consider both the kernel and user-space, if we have to add tons of complexity and hacks in user-space to restrain from touching the kernel, it just leads to a more complex system. I'm all for creating the right abstraction at the right OS layer, and it looks like extending PI futex to support PI of multiple target threads would be a good fit here. I guess the Linux kernel might already need that internally to deal with PI of RCU and rwlocks: all the readers blocking the writer *should* be able to get its priority in parallel when the writer waits, right ? Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com _______________________________________________ ltt-dev mailing list [email protected] http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
