ping On 08/21/2012 11:41 AM, Lai Jiangshan wrote: > On 08/20/2012 09:16 PM, Mathieu Desnoyers wrote: >> * Lai Jiangshan ([email protected]) wrote: >>> On 08/16/2012 05:31 AM, Mathieu Desnoyers wrote: >>>> This work is derived from the patch from Lai Jiangshan submitted as >>>> "urcu: new wfqueue implementation" >>>> (http://lists.lttng.org/pipermail/lttng-dev/2012-August/018379.html) >>>> >>> >> >> Hi Lai, >> >>> Hi, Mathieu >>> >>> please add this part to your patch. >>> >>> Changes(item5 has alternative): >>> >>> 1) Reorder the head and the tail in struct cds_wfq_queue >>> Reason: wfq is enqueue-preference >> >> Is this a technical improvement over the original order ? Both cache >> lines are aligned, so it should not matter which one comes first. So I'm >> not sure why we should favor one order vs the other. >> >>> 2) Reorder the code of ___cds_wfq_next_blocking() >>> Reason: short code, better readability >> >> Yes, I agree with this change. Can you extract it into a separate patch ? >> >>> >>> 3) Add cds_wfq_dequeue_[un]lock >>> Reason: the fields of struct cds_wfq_queue are not exposed to users of >>> lib, >>> but some APIs needs to have this lock held. Add this locking function >>> to allow the users use such APIs >> >> I agree with this change too. You can put it into the same patch as (2). >> >>> >>> 4) Add cds_wfq_node_sync_next(), cds_wfq_dequeue_all_blocking() >>> Reason: helper for for_each_* >> >> Points 4-5-6-7 will need more discussion. >> >> I don't like exposing "cds_wfq_node_sync_next()", which should remain an >> internal implementation detail not exposed to queue users. The >> "get_first"/"get_next" API, as well as the dequeue, are providing an >> abstraction that hide the node synchronization, which makes it harder >> for the user to make mistakes. Given that the fast-path of >> cds_wfq_node_sync_next is just a pointer load and test, I don't think >> that skipping this the second time we iterate on the same queue would >> bring any significant performance improvement, but exposing sync_next >> adds in API complexity, which I would like to avoid. > > If the name "cds_wfq_node_sync_next()" is bad and exposes the implementation, > we can rename it to "__cds_wfq_next_blocking_nocheck()" to hide the > implementation. > > "get_next_nocheck": get next node of a queue, the caller ensure that the next > node is existed. > "get_next": get next node of a queue, it also finds out that whether the next > node is existed or not. > > Or just remove "cds_wfq_node_sync_next()", use "get_next" instead. > >> >> About cds_wfq_dequeue_all_blocking(): the "cds_wfq_splice()" I >> introduced performs the same action as "cds_wfq_dequeue_all_blocking()", >> but ensures that we can splice lists multiple times (rather than only >> once), which is a very neat side-effect. Let's imagine a use-case like >> Tree RCU, where each CPU produce a queue of call_rcu callbacks, and we >> need to merge pairs of queues together recursively until we reach the >> tree root: by doing splice() at each level of the tree, we can combine >> the queues into a single queue without ever needing to do an iteration >> on each node: no node synchronization is needed until we actually need >> to traverse the queue at the root node level. >> >> The cost of exposing dequeue_all is that we need to expose sync_next >> (more complexity for the user). As you exposed in your earlier email, >> the gain we get by exposing dequeue_all separately from splice is that >> dequeue_all is 1 xchg() operation, while splice is 2 xchg() operations. >> However, given that this operation is typically amortized over many >> nodes makes performance optimization a less compelling argument than >> simplicity of use. > > Reducing the temporary big struct is also my tempt. > > --- > Related to "internal", see the bottom. > >> >>> >>> 5) Add more for_each_*, which default semantic is dequeuing >>> Reason: __cds_wfq_for_each_blocking_undequeue is enough for undequeuing >>> loops >>> don't need to add more. >>> looping and dequeuing are the most probable use-cases. >>> dequeuing for_each_* make the queue like a pipe. make it act as a >>> channel.(GO language) >>> Alternative: rename these dequeuing __cds_wfq_for_each*() to >>> __cds_wfq_dequeue_for_each* >> >> I notice, in the iterators you added, the presence of >> "cds_wfq_for_each_blocking_nobreak", and >> "cds_wfq_for_each_blocking_nobreak_internal", which document that this >> special flavor is needed for loops that do not have a "break" statement. >> That seems to be very much error-prone and counter-intuitive for users, >> and I would like to avoid that. > > __cds_wfq_for_each_blocking_safe(remove-all) is also "no-break"(*), is it > error-prone and ...? > > So this is not a good reason to reject cds_wfq_for_each_blocking_nobreak(), > if the user > don't know how to use it, he should use cds_wfq_for_each_blocking() (the > version in my > patch which is dequeue()-based and we can break from the loop body) > > __cds_wfq_for_each_blocking_safe() and cds_wfq_for_each_blocking_nobreak() > are exactly the > same semantic in "remove-all" and "no-break". but > cds_wfq_for_each_blocking_nobreak() > does NOT CORRUPT the queue. > > So cds_wfq_for_each_blocking_nobreak() is yet another > __cds_wfq_for_each_blocking_safe() > and it don't corrupt the queue, it is really safer than > __cds_wfq_for_each_blocking_safe(). > > """"footnote: (*) > __cds_wfq_for_each_blocking_safe() should be used for "remove-none" or > "remove-all" > We should use __cds_wfq_for_each_blocking_undequeue() instead of > __cds_wfq_for_each_blocking_safe(remove-none) > """ > >> >> The "internal" flavor, along with "sync_next" are then used in the >> call_rcu code, since they are needed to get the best performance >> possible. I don't like to use special, internal APIs for call_rcu: the >> original motivation for refactoring the wfqueue code was testability, >> and I would really like to make sure that the APIs we use internally are >> also exposed to the outside world, mainly for testability, and also >> because if we need to directly access internal interfaces to have good >> performance, it means we did a poor job at exporting APIs that allow to >> do the same kind of highly-efficient use of the queue. Exposing an API >> with the "internal" keyword in it clearly discourages users from using >> it. > > Related to "internal" see the bottom. > >> >> >>> >>> 6) Remove __cds_wfq_for_each_blocking_safe >>> Reason: its semantic is not clear, hard to define a well-defined >>> semantic to it. >>> when we use it, we have to delete none or delete all nodes, even when >>> we delete all nodes, >>> struct cds_wfq_queue still becomes stale while looping or after loop. >> >> The typical use-case I envision is: >> >> - enqueue many nodes to queue A >> >> - then, a dequeuer thread splice the content of queue A into its temporary >> queue T (which head is local on its stack). >> >> - now, the thread is free to iterate on queue T, as many times as it >> likes, and it does not need to change the structure of the queue while >> iterating on it (read-only operation, without side-effect). The last >> time it iterates on queue T, it needs to use the _safe iteration and >> free the nodes. >> >> - after the nodes have been deleted, the queue T can be simply discarded >> (if on stack, function return will do so; if dynamically allocated, >> can be simply reinitialized to an empty queue, or freed). > > My for_each_*() can do exactly the same thing in this use-case since we still > have splice() and cds_wfq_for_each_blocking_nobreak()(safer **_safe()). > Doesn't it? > > The memory represented a struct queue is big.(two cache line), something we > need to > avoid to use it in the stack.(shorten the stack size or avoid to cost 2 or 3 > additional cache line). > > -----------a use case > All these local field are hot in the stack(and for some reason, it is not in > the registers) > int count; > struct cds_wfq_queue tmp; > struct cds_wfq_node *node, *next; > > It will use 4 cache line. Should we always recommend the users use > splice()+tmp? > >> >> As you certainly noticed, we can iterate both on a queue being >> concurrently enqueued to, and on a queue that has been spliced into. >> Being able to use iterators and dequeue on any queue (initial enqueue, >> or queue spliced to) is a feature: we can therefore postpone the >> "sync_next" synchronization to the moment where loop iteration is really >> needed (lazy synchronization), which keeps optimal locality of reference >> of synchronization vs node data access. >> >>> >>> 7) Rename old __cds_wfq_for_each_blocking() to >>> __cds_wfq_for_each_blocking_undequeue() >>> Reason: the default semantic of the other for_each_* is dequeuing. >> >> I'm not convinced that "__cds_wfq_for_each_blocking_nobreak", >> "__cds_wfq_for_each_blocking_nobreak_internal", and sync_next APIs are >> improvements over the API I propose. For call_rcu, this turns >> >> cds_wfq_enqueue() >> cds_wfq_splice_blocking() >> __cds_wfq_for_each_blocking_safe() for iteration >> >> into: >> >> cds_wfq_enqueue() >> __cds_wfq_dequeue_all_blocking() >> __cds_wfq_for_each_blocking_nobreak_internal() for iteration >> >> Maybe we could do some documentation improvement to >> "__cds_wfq_for_each_blocking(_safe)", __cds_wfq_first_blocking(), >> __cds_wfq_next_blocking() API members I proposed. Maybe I did not >> document them well enough ? >> > > It does be not enough for __cds_wfq_for_each_blocking_safe(). > > I *STRONGLY OPPOSE* __cds_wfq_for_each_blocking_safe()(although it is > suggested by me). > > If you oppose the dequeue_all(), I agreed with this list: > Remove old __cds_wfq_for_each_blocking_safe() > Use manual splice()+get_first()+get_next() for call_rcu_thread(). > Remove my proposed cds_wfq_for_each_blocking_nobreak() > Remove my proposed sync_next(), dequeue_all(). > Keep dequeue-based for_each() (alternative: introduce it in future) > > If you also agreed this list, could you please partially merge my patch to > your patch > and send it for review to reduce the review cycle. (don't forget the other > minimal fixes in my patch) > > ========== > About internals: > > I want to expose the for_each_*, I don't want to expose these helpers: > cds_wfq_node_sync_next() > cds_wfq_dequeue_all_blocking() > __cds_wfq_dequeue_all_blocking() > __cds_wfq_for_each_blocking_nobreak_internal() > (and I want to disallow the users use these helpers directly). > > but the for_each_* use these helpers, so I have to declare it in the > urcu/wfqueue.h, > and they become a part of ABI. > > So these helpers become internal things, how to handle internal things > in a exposed *.h? > > I found "__cds_list_del()" and "__tls_access_ ## name ()" are also > internals, and they are exposed. Are they exposed correctly? > > In my view, exposed internals is OK, what we can do is disallowing > the users use it directly and ensuring the compatibility. > > And we don't need to test exposed internals directly via testcases, because > they are > internals, just like we don't need to test so much un-exposed internals. > When we test the exposed API, the internals will be also tested. > > The call_rcu_thread() use the internal directly and can't be tested as you > mentioned. > It is not totally true. call_rcu_thread() need to inject a "synchronize_rcu()" > to __cds_wfq_for_each_blocking_nobreak(), so I have to use > __cds_wfq_for_each_blocking_nobreak_internal(). > Tests for __cds_wfq_for_each_blocking_nobreak() are enough for > call_rcu_thread(). > > In one word, I don't agreed full of your disgust to the exposed internals, > but I agree to remove the ones introduced in my patch. > > Thanks, > Lai > > _______________________________________________ > lttng-dev mailing list > [email protected] > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev >
_______________________________________________ lttng-dev mailing list [email protected] http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
