[Tim]
> I know the theoretical number of probes for dicts, but not for sets
> anymore.  The latter use a mix of probe strategies now, "randomish"
> jumps (same as for dicts) but also purely linear ("up by 1") probing
> to try to exploit L1 cache.
>
> It's not _apparent_ to me that the mix actually helps ;-)  I just
> trust that Raymond timed stuff until he was sure it does.

To illustrate, I contrived a case where the set optimizations clearly
pay off.  Output (3.8.1, Win 10 Pro, quad core box with plenty of
RAM):

11184810 nice keys
dict build            0.54
dict iter             0.07
dict lookup           0.81
set build             0.54
set iter              0.08
set lookup            0.82

11184810 nasty keys
dict build           23.32
dict iter             0.09
dict lookup          13.26
set build             9.79
set iter              0.28
set lookup            8.46

I didn't make heroic efforts to keep the machine quiet, and there's
substantial variation across runs.  For example, there's no real
difference beteen "0.07" and "0.08".

Some things to note:

- The "nice" keys act much the same regardless of whether dict or set.
Dicts have an advantage for iteration "in theory" in the absence of
deletions because there are no "holes" in the area storing the
hash+key+value records.  But for these specific keys, the same is true
of sets:  the initial hash probes are at indices 0, 1, 2, ...,
len(the_set)-1, and there are no holes in its hash+key records either
(all the empty slots are together "at the right end").

- Sets get a lot of value out of their fancier probe sequence for the
nasty keys.  There are lots of collisions, and sets can much more
often resolve them from L1 cache.  Better than a factor of 2 for build
time is nothing to sneeze at.

- Iteration for sets is substantially slower in this case, for two
reasons:  (1) sets _do_ have holes for this collection of keys; and,
(2) while I don't think it's documented, the maximum load factor for
sets is lower than for dicts, and 11184810 was picked to give the
maximum load factor for a dict with 2**24 slots.  The set in this case
has twice as many slots as the dict, and all the extras are empty
slots that iteration has to skip over.

- I don't have a theory for why dict build time is _so_ much higher
than dict lookup time for the nasty keys.

- If you don't know a lot about how these things are implemented, just
about everything is baffling ;-)  Integer keys are important in
practice, but things about how modern dicts are implemented were
_driven_ by opportunities to exploit properties of the mostly-trivial
int hash function.  For "general" timing, it's better to use string
keys, whose hash codes are "pretty randomish" (but can vary across
runs).  When using int keys, it's really easy to _end up_ measuring
happy behaviors _specific_ to int keys.

Code:

    from time import perf_counter as now
    import sys
    from collections import deque

    def disp(text, f):
        print(f"{text:20s} {f:5.2f}")

    M = sys.hash_info.modulus
    targetlen = 2**24 * 2 // 3 # highest load factor for dict
    hc = 1
    badkeys = []
    while len(badkeys) < targetlen:
        # add no more than 100 ints with hash code `hc`
        toadd = min(100, targetlen - len(badkeys))
        badkeys.extend(hc + i * M for i in range(toadd))
        hc += 713 # more than less arbitrary
    goodkeys = list(range(targetlen))

    for tag, keys in ("nice", goodkeys), ("nasty", badkeys):
        print()
        print(len(keys), tag, "keys")

        for thetype, builder in ("dict", dict.fromkeys), ("set", set):
            s = now()
            d = builder(keys)
            f = now()
            disp(thetype + " build", f-s)

            s = now()
            deque(d, maxlen=0)
            f = now()
            disp(thetype + " iter", f-s)

            s = now()
            for k in keys:
                assert k in d
            f = now()
            disp(thetype + " lookup", f-s)

            del d
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/S3UB5TCSUBLYSESG7WO2HNCFBZB5N4BG/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to