On Sep 20, 2019, at 03:18, Richard Musil <risa20...@gmail.com> wrote: > > This is an interesting point, which is difficult to deduce without an > implementation insight. I would for example assume that there is a "list > construct" which holds just the references to the actual objects (strings in > our example here) and the same for the dict (hashmap), so the actual objects > could be anywhere, while the "list" or "dict" can be in one (memory) place. > Did you mean that, or that also the object themselves are in the "same place"?
With a list of strings, it’s actually even more than that. A CPython list is a (pointer to a) struct that holds, along with a few other things, a pointer to an array of PyObject* pointers. This array is obviously contiguous by definition. By comparison, a set is a (pointer to a) struct that holds a pointer to a hash table, which is an array of buckets in hash order, which isn’t the order you’re going to be accessing them in for the b set. If the values are strings, the pointers point to PyUnicode structs, each of which holds… well, it’s complicated, but in this case (where they’re all pure ASCII and constructed out of normal Python operations) it’s going to turn out to be a pointer to a char*, which is the array of actual characters. Also notice that string objects cache their hash in the struct, so for the set case, it only has to go to the char array at the very end to double-check equality; the operations before that only go one level down instead of two. Comparisons have to go all the way to the chars every time. The PyUnicode structs get allocated by a special object allocator which is somewhat complicated. When you construct a bunch of objects in order in a fresh process they’re likely to end up mostly in contiguous order, maybe with a few jumps every so often. This would not be as true in a long-lived process that’s created and freed zillions of previous objects. And then the char arrays are allocated by a different allocator, but are also likely to be largely in contiguous order in this test, but not so much in a long-lived process. In theory, because str is a known immutable type, Python is allowed to merge dups into the same object. But in practice, we’re creating midsize strings out of __add__ and join, so I don’t think CPython will ever do that here. (I don’t know whether an implementation designed to sacrifice speed for space more, like MicroPython, might?) Of course if you generate random strings and sort them, neither the objects nor the char arrays will be contiguous anymore, but the list’s pointer array will be. My version of the test generates strings that happen to be in order and shuffles them so we can sort them. In that case, the sorting part won’t have contiguous input in the objects and char arrays, but the input to the step-compare part will be close to contiguous again (although not exactly so, if there are any dups that could get shuffled out of order), so it’s still giving the list code an unfair advantage that it wouldn’t have in a real life unsorted case. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/FUDJU4AKJIFM2QNZP7HCMDQB2I6GWXPG/ Code of Conduct: http://python.org/psf/codeofconduct/