geremy condra wrote: > On Thu, Nov 5, 2009 at 4:09 PM, Alexander Belopolsky > <alexander.belopol...@gmail.com> wrote: >> On Thu, Nov 5, 2009 at 3:43 PM, Chris Bergstresser <ch...@subtlety.com> >> wrote: >>> .. and "x = iter(s).next()" raises a StopIteration >>> exception. >> And that's why the documented recipe should probably recommend >> next(iter(s), default) instead. Especially because iter(s).next() is >> not even valid code in 3.0. > > This seems reasonably legible to you? Strikes me as coding by > incantation. Also, while I've heard people say that the naive > approach is slower, I'm not getting that result. Here's my test: > > >>>> smrt = timeit.Timer("next(iter(s))", "s=set(range(100))") >>>> smrt.repeat(10) > [1.2845709323883057, 0.60247397422790527, 0.59621405601501465, > 0.59133195877075195, 0.58387589454650879, 0.56839084625244141, > 0.56839680671691895, 0.56877803802490234, 0.56905913352966309, > 0.56846404075622559] > >>>> naive = timeit.Timer("x=s.pop();s.add(x)", "s=set(range(100))") >>>> naive.repeat(10) > [0.93139314651489258, 0.53566789627075195, 0.53674602508544922, > 0.53608798980712891, 0.53634309768676758, 0.53557991981506348, > 0.53578495979309082, 0.53666114807128906, 0.53576493263244629, > 0.53491711616516113] > > > Perhaps my test is flawed in some way? > > Geremy Condra
Well, it also uses a fairly trivial set. 'set(range(100))' is going to generally have 0 collisions and everything will hash to a unique bucket. As such, pop ing an item out of the hash is a single "val = table[int & mask]; table[int & mask] = _dummy", and then looking it up again requires 2 table lookups (one finds the _dummy, the next finds that the chain is broken and can rewrite the _dummy.) However, if a set is more full, or has more collisions, or ... then pop() and add() become relatively more expensive. Surprising to me, is that "next(iter(s))" was actually slower than .pop() + .add() for 100 node set in my testing: $ alias TIMEIT="python -m timeit -s 's = set(range(100)'" $ TIMEIT "x = next(iter(s))" 1000000 loops, best of 3: 0.263 usec per loop $ TIMEIT "x = s.pop(); s.add(x)" 1000000 loops, best of 3: 0.217 usec per loop though both are much slower than the fastest we've found: $ TIMEIT "for x in s: break" 10000000 loops, best of 3: 0.0943 usec per loop now, what about a set with *lots* of collisions. Create 100 keys that all hash to the same bucket: aliase TIMEIT="python -m timeit -s 's = set([x*1024*1024 for x in range(100)])'" $ TIMEIT "x = next(iter(s))" 1000000 loops, best of 3: 0.257 usec per loop $ TIMEIT "x = s.pop(); s.add(x)" 1000000 loops, best of 3: 0.218 usec per loop I tried a few different ways, and I got the same results, until I did: $ python -m timeit -s "s = set(range(100000, 1000100))" "next(iter(s))" 1000 loops, best of 3: 255 usec per loop ^^^^ Now something seems terribly wrong here. next(iter(s)) suddenly jumps up from being < 0.3 us, to being more than 200us. Or ~1000x slower. I realize the above has 900k keys, which is big. But 'next(iter(s))' should only touch 1, and, in fact, should always return the *first* entry. My best guess is just that the *first* entry in the internal set table is no longer close to offset 0, which means that 'next(iter(s))' has to evaluate a bunch of table slots before it finds a non-empty entry. Anyway, none of the proposals will really ever be faster than: for x in s: break It is a bit ugly of a construct, but you don't have an attribute lookup, etc. As long as you don't do: for x in s: pass Then it stays nice and fast. John =:-> _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com