Peng, I actually am thinking about it.

Underlying problem: while unordered means conceptually unordered as far as the collection goes, the items in the collection, if homogenous enough, may have a natural order, which users find hard to ignore. Even if not comparable, an implementation such as CPython that uses linear sequential memory will impose some order. Even if the implementation uses unordered (holographic?) memory, order will be imposed to iterate, as when creating a serialized representation of the collection. Abstract objects, concrete objects, and serialized representations are three different things, but people tend to conflate them.

Order consistency issues: if the unordered collection is iterated, when can one expect the order to be the same? Theoretically, essentially never, except that iterating dicts by keys, values, or key-value pairs is guaranteed to be consistent, which means that re-iterating has to be consistent. I actually think the same might as well be true for sets, although there is no doc that says so.

If two collections are equal, should the iteration order be the same? It has always been true that if hash values collide, insertion order matters. However, a good hash function avoids hash collisions as much as possible in practical use cases. Without doing something artificial, as I did with the example, collisions should be especially rare on 64-bit builds. If one collection has a series of additions and deletions so that the underlying hash table has a different size than an equal collection build just from insertions, then order will also be different.

If the same collection is built by insertion in the same order, but in different runs, bugfix versions, or language versions, will iteration order by the same? Historically, it has been for CPython for about a decade, and people has come to depend on that constancy, in spite of warning not to. Even core developers have not been immune, as the CPython test suite has a few set or dict iteration order dependencies until it was recently randomized.

Late last year, it became obvious that this constancy is a practical denial-of-service security hole. The option to randomize hashing for each run was added to 2.6, 2.7, 3.1, and 3.2. Randomized hashing by run is part of 3.3. So some of the discussion above is obsolete. The example I gave only works for that one run, as hash('a') changes each run. So iteration order now changes with each run in fact as well as in theory.

For the doc, the problem is what to say and where without being repetitous (and to get multiple people to agree ;-).

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to