* Paul E. McKenney ([email protected]) wrote: > On Wed, Apr 18, 2012 at 04:10:17PM -0400, Mathieu Desnoyers wrote: > > Hi, > > > > FYI, I pushed extra documentation of the RCU lock-free Hash Table found > > in userspace RCU library master branch regarding the linearizability > > guarantees it provides. Feedback is welcome, >
Hi Paul, Your reply brings interesting thoughts, see below, > Interesting! Please see below for my take on it. > > Thanx, Paul > > > Thanks, > > > > Mathieu > > > > commit 0f5543cb1780acef35878646e6cdc966f1406c18 > > Author: Mathieu Desnoyers <[email protected]> > > Date: Wed Apr 18 16:05:29 2012 -0400 > > > > rculfhash: document linearizability guarantees > > > > Signed-off-by: Mathieu Desnoyers <[email protected]> > > > > diff --git a/rculfhash.c b/rculfhash.c > > index 6f470fd..d22b44d 100644 > > --- a/rculfhash.c > > +++ b/rculfhash.c > > @@ -98,6 +98,33 @@ > > * hash table nodes. These tables are invariant after they are > > * populated into the hash table. > > * > > + * Linearizability Guarantees: > > I suggest calling these ordering guarantees. Otherwise, people will > glance at the heading and assume that the hash table is linearizable, > which from what I can see is not the case. (And I do -not- recommend > making it be the case, just to be sure that there is no confusion.) Ordering guarantees works for me, but I'm curious to know the reasons that make you think it's not linearizeable. AFAIU, linearizability requires: - That there is an atomic step at which each operation takes place, located between the start and the end of each method (linearization point). - Progress guarantees for each method. So looking at linearization points for updates: - add: takes place atomically at the cmpxchg that links the new node into the linked list. - add_replace: takes place atomically at the uatomic_cmpxchg responsible for linking the new node into the linked list, replacing an existing node (add_replace) atomically by setting the "removed" flag into the replaced node's next pointer at the same time as the new node is inserted. - add_unique: takes place atomically at the cmpxchg that either links the new node into the linked list, or otherwise detects that the prev node next pointer has changed, triggering a retry. - del: takes place atomically by setting the "removed" flag into the next pointer of the node to remove (atomic "or"). Linearization points for reads: those seem to apply to individual operations (e.g. lookup, get_first, get_next, get_next_duplicate): - lookup: atomically reading the next pointer of the node prior to either: - the node containing the key we are looking for, - the node containing the first key above the key we are looking for, - the end of the list (next pointer is NULL). - get_first: atomically reading the head next pointer. - get_next: atomically reading the next pointer of the current node. - get_next_duplicate: atomically reading the next pointer of the current node. Progress guarantees: Updates: Each retry faced by an update is always caused by another concurrent operation that itself progresses (lock-freedom type of progress guarantee). Reads: Read operations never loop: the linked lists only go in one direction, from the head to the tail. We have to note that for linearizability, I used the basic read operations (which happen atomically) rather than sequence of such operations. One item I'm not convinced could be called entirely linearizable is the resize operation: it happens lazily at some point in time that is not usually within the update operations. For that one, I'd be tempted to say that the hash table is not linearizable. Thoughts ? > > > + * > > + * To discuss these guarantees, we first define "read" operations as any > > + * of the following operations surrounded by an RCU read-side lock/unlock > > + * pair: > > + * - cds_lfht_lookup > > + * - cds_lfht_lookup followed by iteration with cds_lfht_next_duplicate > > + * - cds_lfht_first followed iteration with cds_lfht_next > > + * > > + * We define "write" operations as any of cds_lfht_add, > > + * cds_lfht_add_unique, cds_lfht_add_replace, cds_lfht_del. > > + * > > + * The following guarantees are offered by this hash table: > > + * > > + * A) "read" after "write" will always return the result of the latest > > + * write. > > Hmmm... "Latest" by what measure? What is the guarantee really > saying if there are a large number of writes and reads all involving > the same hash key? > > My guess is that the guarantee is actually of the following form: > If there is ordering between a write and a later read, then the > read is guaranteed to see the write or some later write. Yes, although I think thinking in terms of "sequence of reads" might be more appropriate than "basic read operation". By that, I mean thinking in terms of iteration over the entire hash table, or iteration over a key and all its duplicates, rather than the basic lookup/get next/get next duplicate/get first operations. > As I understand it, this guarantee requires that the read and write > operate on the same key. Well, a sequence of "read" operations can be an iteration on the whole table (get first, then get_next until we reach the end of the table). So what I'm trying to say here is that if we have a sequence of read operations for which the first operation is ordered after a write, those are guaranteed to see this write or some later write. Of course, if the sequence of read operations is executed across the write, it may or may not see the write. And if the sequence of read operations is known to have its last read ordered before a write, it is ensured that it will not see this write. > > This might be easier given a guarantee that writes involving a given > key are seen in order. I believe that the hash table does provide this > guarantee and that it would be good to state it explicitly. I'm wondering if the reasoning about a single "key" you are proposing takes into account that the hash table supports duplicate keys ? > > > + * B) "write" after "read" will never be returned by the read. > > There is a useful guarantee involving a pair of reads: If a pair > of reads of a given key are ordered (e.g., by a memory barrier), > then the second read will return the same value as the first > read or some later value. Yep, e.g. if we use add_replace to replace a given node with a version that has an incremented counter, reads that are ordered with memory barriers are guaranteed to see this counter always incrementing. > > > + * C) It is guaranteed that after a grace period following a "del" and > > + * "replace" operation, no reference to the removed items exists in > > + * the hash table. > > I would instead say something like: If a grace period separates a "del" > or "replace" operation and a subsequent read operation, then that reader > is guaranteed not to see the removed item. Your statement is true, but the actual idea I want to convey is stronger than that: after that grace period, we guarantee that no read nor update operation can see the removed pointer (even just internally). > > > + * D) Uniqueness guarantee: when using add_unique and/or add_replace to > > + * insert nodes into the table, if there was previously one node or > > + * less with the same key being inserted by one or more concurrent > > + * add_unique and/or add_replace, all concurrent "read" performed on > > + * the hash table are guaranteed to find one, and only one node with > > + * that key. > > How about: Given a hash table that does not contain duplicate items > for a given key, there will only be one item in the hash table after > an arbitrary sequence of add_unique and/or add_replace operations. > Note, however, that a pair of concurrent read operations might well > access two different items with that key. Yes, that works, Thanks for the feedback! Mathieu > > > * Bucket node tables: > > * > > * hash table hash table the last all bucket node tables > > > > -- > > Mathieu Desnoyers > > Operating System Efficiency R&D Consultant > > EfficiOS Inc. > > http://www.efficios.com > > > -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com _______________________________________________ lttng-dev mailing list [email protected] http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
