> On Feb 26, 2019, at 3:30 AM, INADA Naoki <songofaca...@gmail.com> wrote:
>
> I'm working on compact and ordered set implementation.
> It has internal data structure similar to new dict from Python 3.6.
>
> On Feb 26, 2019, at 3:30 AM, INADA Naoki <songofaca...@gmail.com> wrote:
>
> I'm working on compact and ordered set implementation.
> It has internal data structure similar to new dict from Python 3.6
I've also looked at this as well. Some thoughts:
* Set objects have a different and conflicting optimization that works better
for a broad range of use cases. In particular, there is a linear probing
search step that gives excellent cache performance (multiple entries retrieved
in a single cache line) and it reduces the cost of finding the next entry to a
single increment (entry++). This greatly reduces the cost of collisions and
makes it cheaper to verify an item is not in a set.
* The technique for compaction involves making the key/hash entry array dense
and augmenting it with a sparse array of indices. This necessarily involves
adding a layer of indirection for every probe.
* With the cache misses, branching costs, and extra layer of indirection,
collisions would stop being cheap, so we would need to work to avoid them
altogether. To get anything like the current performance for a collision of the
first probe, I suspect we would have to lower the table density down from 60%
to 25%.
* The intersection operation has an important optimization where it loops over
the smaller of its two inputs. To give a guaranteed order that preserves the
order of the first input, you would have to forgo this optimization, possibly
crippling any existing code that depends on it.
* Maintaining order in the face of deletions adds a new workload to sets that
didn't exist before. You risk thrashing the set support a feature that hasn't
been asked for and that isn't warranted mathematically (where the notion of
sets is unordered).
* It takes a lot of care and planning to avoid fooling yourself with benchmarks
on sets. Anything done with a small tight loop will tend to hide all branch
prediction costs and cache miss costs, both of which are significant in real
world uses of sets.
* For sets, we care much more about look-up performance than space. And unlike
dicts where we usually expect to find a key, sets are all about checking
membership which means they have to be balanced for the case where the key is
not in the set.
* Having and preserving order is one of the least important things a set can
offer (it does have some value, but it is the least important feature, one that
was never contemplated by the original set PEP).
After the success of the compact dict, I can understand an almost irresistible
urge to apply the same technique to sets. If it was clear that it was a win, I
would have already done it long ago, even before dicts (it was much harder to
get buy in to changing the dicts). Please temper the enthusiasm with
rationality and caution. The existing setobject code has been finely tuned and
micro-optimized over the years, giving it excellent performance on workloads we
care about. It would be easy throw all of that away.
Raymond
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com