Author: Armin Rigo <ar...@tunes.org> Branch: Changeset: r1434:57c2c6d8f2ef Date: 2014-09-28 16:52 +0200 http://bitbucket.org/pypy/stmgc/changeset/57c2c6d8f2ef/
Log: Uh, sorry, 'design.txt' is the one I meant to commit... diff --git a/hashtable/design.txt b/hashtable/design.txt new file mode 100644 --- /dev/null +++ b/hashtable/design.txt @@ -0,0 +1,72 @@ +Goal +====== + +The goal is to have dictionaries where a read-write or write-write +conflict does not cause aborts if they occur with keys that have +different 64-bit hashes. + +(We might prefer the condition to be "on different keys even in case of +hash collision", but that's hard to achieve in general: for Python +dicts, "to be equal" involves calling the __eq__() method of objects.) + +We distinguish between reading a value associated to a key, and only +checking that the key is in the dictionary. It makes a difference if +a concurrent transaction writes to the value. + +Some operations on a dict (particularly len() and __nonzero__()) involve +the global state of the dict, and thus cause conflict with any write +that adds or removes keys. They should not cause conflicts with reads, +or with writes that only change existing values. + +In theory, we might refine this to: a len() only conflicts with a +different transaction whose net effect is to change the length (adding 3 +keys and removing 3 other keys is fine); and a __nonzero__() only +conflicts with a transaction whose net effect is to change the dict from +empty to non-empty or vice-versa. + +Iterating over the keys of a dict doesn't have to conflict with other +transactions that only change existing values. Iterating over the +values or the items conflict with other transactions doing any write at +all. + + +Model +======= + +We can use the following idea to give a theoretical model of the +above: + +Let H = {0, ... 2**64-1} be the set of possible hashes. A dictionary is +an array of length 2**64, where each item contains a "line" of zero or +more key/value pairs. We have STM read and write markers as follows: + +* for every key/value pair, we have two markers (a read and a write) on + the "value"; + +* for every line (i.e. for every possible hash value), we also have two + markers (a read and a write) on the line itself. + +Then: + +* Reading or writing the value associated with an existing key accesses + the read marker of the line, and the read or write marker of that + particular value. + +* Checking for the presence of a key only accesses the read marker of + the line. + +* Creating a new key accesses the write marker of the line (the write + marker of the newly added value is not relevant then, because other + transactions won't be able to access the line anyway). + +* Deleting a key also accesses the write marker of the line. (We cannot + do it by pretending the write the value NULL, so accessing only the + write marker of the value, because then it wouldn't conflict with + another transaction that checks for the presence of the key by + accessing only the read marker of the line.) + +* Global operations, like getting the list of keys, work by mass-marking + all the lines in H (all 2**64 of them, so obviously it needs to be + special-cased in the implementation). More precisely, len(), keys(), + etc., sets all the lines' read markers; clear() sets all the lines' + write markers. diff --git a/hashtable/design2.txt b/hashtable/design2.txt deleted file mode 100644 --- a/hashtable/design2.txt +++ /dev/null @@ -1,124 +0,0 @@ -Goal -====== - -The goal is to have dictionaries where a read-write or write-write -conflict does not cause aborts if they occur with keys that have -different 64-bit hashes. - -(We might prefer the condition to be "on different keys even in case of -hash collision", but that's hard to achieve in general: for Python -dicts, "to be equal" involves calling the __eq__() method of objects.) - -We distinguish between reading a value associated to a key, and only -checking that the key is in the dictionary. It makes a difference if -a concurrent transaction writes to the value. - -Some operations on a dict (particularly len() and __nonzero__()) involve -the global state of the dict, and thus cause conflict with any write -that adds or removes keys. They should not cause conflicts with reads, -or with writes that only change existing values. - -In theory, we might refine this to: a len() only conflicts with a -different transaction whose net effect is to change the length (adding 3 -keys and removing 3 other keys is fine); and a __nonzero__() only -conflicts with a transaction whose net effect is to change the dict from -empty to non-empty or vice-versa. The latter is probably more important -than the former, so we'll ignore the former. - -Iterating over the keys of a dict doesn't have to conflict with other -transactions that only change existing values. Iterating over the -values or the items conflict with other transactions doing any write at -all. - - -Idea -====== - -A dict is implemented used two distinct parts: the committed part, -and the uncommitted one. Each part is optimized differently. - - -Committed part --------------- - -The committed part uses separate chaining with linked lists. It is an -array of pointers of length some power of two. From the hash, we access -item (hash & (power_of_two - 1)). We get a pointer to some Entry -object, with fields "hash", "key", "value", and "next". The "hash" -field stored in the Entry objects is the full 64-bit hash. The "next" -field might point to more Entry objects. - -This whole structure is only modified during commit, by special code not -subject to the normal STM rules. There is only one writer, the -transaction currently trying to commit; but we need to be careful so that -concurrent reads work as expected. - -For the sequel, the committed part works theoretically like an array of -length 2**64, indexed by the hash, where each item contains zero of more -Entry objects with that hash value. - - -Uncommitted part ----------------- - -For the uncommitted part we can use a hash table similar to the one used -for RPython dicts, with open addressing. We import data from the -committed part to this uncommitted part when needed (at the granularity -of a 64-bit hash value). More precisely, the uncommitted part can be in -one of these states: - -* It can be a freshly created dictionary, with no committed part yet. - That's the easy case: the uncommitted hash table is all we need. - -* Or, we have a committed part, and we have imported from it - zero or more 64-bit hash values. We need to remember which ones. - That includes the imports that yielded zero key/value pairs. For each - imported hash value, we make (zero or more) entries in the uncommitted - part where we copy the key, but where the value is initially missing. - The value is copied lazily, with another lookup that will mark the - Entry object as "read" in the usual STM sense. - -* We may have additionally imported the "emptiness" or "non-emptiness" - of the committed part. - -* Last option: the current transaction is depending on the exact set - of committed keys. We no longer need to remember which ones - individually. This state is equivalent to having imported *all* - possible 64-bit hash values. - - -Commit time ------------ - -At commit time, we need to do these extra steps. The points followed by -(*) need to be done carefully because concurrent threads might be -reading the same data. - -* First, we do the usual STM validation. It will detect read-write - and write-write conflicts on existing values thanks to the read and - write markers xxx - -* We validate the keys: for every imported hash value, we check that - importing it now would give us the same answer as it did previously - (i.e. the committed table has got the same set of keys with this - particular hash as it did previously). - -* For key/value pairs that have been newly added by the current - transaction, the validation above is enough too: to add a key/value - pair, we must have imported all Entries with the same hash anyway. - So at this point we only need to create and attach(*) new Entry - objects for new key/value pairs. - - -xxxxxxxxxxx - - - -* First, we create new Entry objects for all key/value pairs that - are created by the current transaction. - -* First, we create or update the Entry objects for all key/value - pairs that have been modified by the current transaction. We - store the new ones by carefully changing the array of pointers. - -* _______________________________________________ pypy-commit mailing list pypy-commit@python.org https://mail.python.org/mailman/listinfo/pypy-commit