* Lai Jiangshan ([email protected]) wrote: > ping we discussed a "batch" design offline. Do you think you would have time to spin a patch implementing it ?
Thanks, Mathieu > > On 08/21/2012 11:41 AM, Lai Jiangshan wrote: > > On 08/20/2012 09:16 PM, Mathieu Desnoyers wrote: > >> * Lai Jiangshan ([email protected]) wrote: > >>> On 08/16/2012 05:31 AM, Mathieu Desnoyers wrote: > >>>> This work is derived from the patch from Lai Jiangshan submitted as > >>>> "urcu: new wfqueue implementation" > >>>> (http://lists.lttng.org/pipermail/lttng-dev/2012-August/018379.html) > >>>> > >>> > >> > >> Hi Lai, > >> > >>> Hi, Mathieu > >>> > >>> please add this part to your patch. > >>> > >>> Changes(item5 has alternative): > >>> > >>> 1) Reorder the head and the tail in struct cds_wfq_queue > >>> Reason: wfq is enqueue-preference > >> > >> Is this a technical improvement over the original order ? Both cache > >> lines are aligned, so it should not matter which one comes first. So I'm > >> not sure why we should favor one order vs the other. > >> > >>> 2) Reorder the code of ___cds_wfq_next_blocking() > >>> Reason: short code, better readability > >> > >> Yes, I agree with this change. Can you extract it into a separate patch ? > >> > >>> > >>> 3) Add cds_wfq_dequeue_[un]lock > >>> Reason: the fields of struct cds_wfq_queue are not exposed to users of > >>> lib, > >>> but some APIs needs to have this lock held. Add this locking function > >>> to allow the users use such APIs > >> > >> I agree with this change too. You can put it into the same patch as (2). > >> > >>> > >>> 4) Add cds_wfq_node_sync_next(), cds_wfq_dequeue_all_blocking() > >>> Reason: helper for for_each_* > >> > >> Points 4-5-6-7 will need more discussion. > >> > >> I don't like exposing "cds_wfq_node_sync_next()", which should remain an > >> internal implementation detail not exposed to queue users. The > >> "get_first"/"get_next" API, as well as the dequeue, are providing an > >> abstraction that hide the node synchronization, which makes it harder > >> for the user to make mistakes. Given that the fast-path of > >> cds_wfq_node_sync_next is just a pointer load and test, I don't think > >> that skipping this the second time we iterate on the same queue would > >> bring any significant performance improvement, but exposing sync_next > >> adds in API complexity, which I would like to avoid. > > > > If the name "cds_wfq_node_sync_next()" is bad and exposes the > > implementation, > > we can rename it to "__cds_wfq_next_blocking_nocheck()" to hide the > > implementation. > > > > "get_next_nocheck": get next node of a queue, the caller ensure that the > > next node is existed. > > "get_next": get next node of a queue, it also finds out that whether the > > next node is existed or not. > > > > Or just remove "cds_wfq_node_sync_next()", use "get_next" instead. > > > >> > >> About cds_wfq_dequeue_all_blocking(): the "cds_wfq_splice()" I > >> introduced performs the same action as "cds_wfq_dequeue_all_blocking()", > >> but ensures that we can splice lists multiple times (rather than only > >> once), which is a very neat side-effect. Let's imagine a use-case like > >> Tree RCU, where each CPU produce a queue of call_rcu callbacks, and we > >> need to merge pairs of queues together recursively until we reach the > >> tree root: by doing splice() at each level of the tree, we can combine > >> the queues into a single queue without ever needing to do an iteration > >> on each node: no node synchronization is needed until we actually need > >> to traverse the queue at the root node level. > >> > >> The cost of exposing dequeue_all is that we need to expose sync_next > >> (more complexity for the user). As you exposed in your earlier email, > >> the gain we get by exposing dequeue_all separately from splice is that > >> dequeue_all is 1 xchg() operation, while splice is 2 xchg() operations. > >> However, given that this operation is typically amortized over many > >> nodes makes performance optimization a less compelling argument than > >> simplicity of use. > > > > Reducing the temporary big struct is also my tempt. > > > > --- > > Related to "internal", see the bottom. > > > >> > >>> > >>> 5) Add more for_each_*, which default semantic is dequeuing > >>> Reason: __cds_wfq_for_each_blocking_undequeue is enough for > >>> undequeuing loops > >>> don't need to add more. > >>> looping and dequeuing are the most probable use-cases. > >>> dequeuing for_each_* make the queue like a pipe. make it act as a > >>> channel.(GO language) > >>> Alternative: rename these dequeuing __cds_wfq_for_each*() to > >>> __cds_wfq_dequeue_for_each* > >> > >> I notice, in the iterators you added, the presence of > >> "cds_wfq_for_each_blocking_nobreak", and > >> "cds_wfq_for_each_blocking_nobreak_internal", which document that this > >> special flavor is needed for loops that do not have a "break" statement. > >> That seems to be very much error-prone and counter-intuitive for users, > >> and I would like to avoid that. > > > > __cds_wfq_for_each_blocking_safe(remove-all) is also "no-break"(*), is it > > error-prone and ...? > > > > So this is not a good reason to reject cds_wfq_for_each_blocking_nobreak(), > > if the user > > don't know how to use it, he should use cds_wfq_for_each_blocking() (the > > version in my > > patch which is dequeue()-based and we can break from the loop body) > > > > __cds_wfq_for_each_blocking_safe() and cds_wfq_for_each_blocking_nobreak() > > are exactly the > > same semantic in "remove-all" and "no-break". but > > cds_wfq_for_each_blocking_nobreak() > > does NOT CORRUPT the queue. > > > > So cds_wfq_for_each_blocking_nobreak() is yet another > > __cds_wfq_for_each_blocking_safe() > > and it don't corrupt the queue, it is really safer than > > __cds_wfq_for_each_blocking_safe(). > > > > """"footnote: (*) > > __cds_wfq_for_each_blocking_safe() should be used for "remove-none" or > > "remove-all" > > We should use __cds_wfq_for_each_blocking_undequeue() instead of > > __cds_wfq_for_each_blocking_safe(remove-none) > > """ > > > >> > >> The "internal" flavor, along with "sync_next" are then used in the > >> call_rcu code, since they are needed to get the best performance > >> possible. I don't like to use special, internal APIs for call_rcu: the > >> original motivation for refactoring the wfqueue code was testability, > >> and I would really like to make sure that the APIs we use internally are > >> also exposed to the outside world, mainly for testability, and also > >> because if we need to directly access internal interfaces to have good > >> performance, it means we did a poor job at exporting APIs that allow to > >> do the same kind of highly-efficient use of the queue. Exposing an API > >> with the "internal" keyword in it clearly discourages users from using > >> it. > > > > Related to "internal" see the bottom. > > > >> > >> > >>> > >>> 6) Remove __cds_wfq_for_each_blocking_safe > >>> Reason: its semantic is not clear, hard to define a well-defined > >>> semantic to it. > >>> when we use it, we have to delete none or delete all nodes, even when > >>> we delete all nodes, > >>> struct cds_wfq_queue still becomes stale while looping or after loop. > >> > >> The typical use-case I envision is: > >> > >> - enqueue many nodes to queue A > >> > >> - then, a dequeuer thread splice the content of queue A into its temporary > >> queue T (which head is local on its stack). > >> > >> - now, the thread is free to iterate on queue T, as many times as it > >> likes, and it does not need to change the structure of the queue while > >> iterating on it (read-only operation, without side-effect). The last > >> time it iterates on queue T, it needs to use the _safe iteration and > >> free the nodes. > >> > >> - after the nodes have been deleted, the queue T can be simply discarded > >> (if on stack, function return will do so; if dynamically allocated, > >> can be simply reinitialized to an empty queue, or freed). > > > > My for_each_*() can do exactly the same thing in this use-case since we > > still > > have splice() and cds_wfq_for_each_blocking_nobreak()(safer **_safe()). > > Doesn't it? > > > > The memory represented a struct queue is big.(two cache line), something we > > need to > > avoid to use it in the stack.(shorten the stack size or avoid to cost 2 or > > 3 additional cache line). > > > > -----------a use case > > All these local field are hot in the stack(and for some reason, it is not > > in the registers) > > int count; > > struct cds_wfq_queue tmp; > > struct cds_wfq_node *node, *next; > > > > It will use 4 cache line. Should we always recommend the users use > > splice()+tmp? > > > >> > >> As you certainly noticed, we can iterate both on a queue being > >> concurrently enqueued to, and on a queue that has been spliced into. > >> Being able to use iterators and dequeue on any queue (initial enqueue, > >> or queue spliced to) is a feature: we can therefore postpone the > >> "sync_next" synchronization to the moment where loop iteration is really > >> needed (lazy synchronization), which keeps optimal locality of reference > >> of synchronization vs node data access. > >> > >>> > >>> 7) Rename old __cds_wfq_for_each_blocking() to > >>> __cds_wfq_for_each_blocking_undequeue() > >>> Reason: the default semantic of the other for_each_* is dequeuing. > >> > >> I'm not convinced that "__cds_wfq_for_each_blocking_nobreak", > >> "__cds_wfq_for_each_blocking_nobreak_internal", and sync_next APIs are > >> improvements over the API I propose. For call_rcu, this turns > >> > >> cds_wfq_enqueue() > >> cds_wfq_splice_blocking() > >> __cds_wfq_for_each_blocking_safe() for iteration > >> > >> into: > >> > >> cds_wfq_enqueue() > >> __cds_wfq_dequeue_all_blocking() > >> __cds_wfq_for_each_blocking_nobreak_internal() for iteration > >> > >> Maybe we could do some documentation improvement to > >> "__cds_wfq_for_each_blocking(_safe)", __cds_wfq_first_blocking(), > >> __cds_wfq_next_blocking() API members I proposed. Maybe I did not > >> document them well enough ? > >> > > > > It does be not enough for __cds_wfq_for_each_blocking_safe(). > > > > I *STRONGLY OPPOSE* __cds_wfq_for_each_blocking_safe()(although it is > > suggested by me). > > > > If you oppose the dequeue_all(), I agreed with this list: > > Remove old __cds_wfq_for_each_blocking_safe() > > Use manual splice()+get_first()+get_next() for call_rcu_thread(). > > Remove my proposed cds_wfq_for_each_blocking_nobreak() > > Remove my proposed sync_next(), dequeue_all(). > > Keep dequeue-based for_each() (alternative: introduce it in future) > > > > If you also agreed this list, could you please partially merge my patch to > > your patch > > and send it for review to reduce the review cycle. (don't forget the other > > minimal fixes in my patch) > > > > ========== > > About internals: > > > > I want to expose the for_each_*, I don't want to expose these helpers: > > cds_wfq_node_sync_next() > > cds_wfq_dequeue_all_blocking() > > __cds_wfq_dequeue_all_blocking() > > __cds_wfq_for_each_blocking_nobreak_internal() > > (and I want to disallow the users use these helpers directly). > > > > but the for_each_* use these helpers, so I have to declare it in the > > urcu/wfqueue.h, > > and they become a part of ABI. > > > > So these helpers become internal things, how to handle internal things > > in a exposed *.h? > > > > I found "__cds_list_del()" and "__tls_access_ ## name ()" are also > > internals, and they are exposed. Are they exposed correctly? > > > > In my view, exposed internals is OK, what we can do is disallowing > > the users use it directly and ensuring the compatibility. > > > > And we don't need to test exposed internals directly via testcases, because > > they are > > internals, just like we don't need to test so much un-exposed internals. > > When we test the exposed API, the internals will be also tested. > > > > The call_rcu_thread() use the internal directly and can't be tested as you > > mentioned. > > It is not totally true. call_rcu_thread() need to inject a > > "synchronize_rcu()" > > to __cds_wfq_for_each_blocking_nobreak(), so I have to use > > __cds_wfq_for_each_blocking_nobreak_internal(). > > Tests for __cds_wfq_for_each_blocking_nobreak() are enough for > > call_rcu_thread(). > > > > In one word, I don't agreed full of your disgust to the exposed internals, > > but I agree to remove the ones introduced in my patch. > > > > Thanks, > > Lai > > > > _______________________________________________ > > lttng-dev mailing list > > [email protected] > > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev > > > -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com _______________________________________________ lttng-dev mailing list [email protected] http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
