Raymond Hettinger added the comment:

> How this change affects this case?

I would expect to see a tiny net benefit.  The set.add() would be 
microscopically faster (because of less work finding and going back to a free 
slot).  The set.discard() would do the same work (locating a value and marking 
it as a dummy).  The resize step would run at the same speed (it just skips 
over dummies and reinserts only active values).  The resize would run a little 
earlier (because we're not reusing a small proportion of the dummy entries) but 
it would clean out 100% of the dummies, making the table more sparse sooner.

> Sets often are used in following pattern

Not really.  This is one use pattern out of many and is one of the least 
common.  The two dominant use cases for sets are uniquification (deduping) and 
fast membership testing.  The third most common case is data analysis using 
fast set-to-set operations (union, intersection, and difference).  Then comes 
cases with individual element membership tests followed by an individual 
element set.add (i.e. the "if x not in seen: {seen.add(x); work_on(x);}" case). 
 Dead last are the affected cases that the bounce around with a mix of 
set.add() and set.discard() or set.remove().

The comment in dictobject.c suggests that the latter case is uncommon by a 
factor of hundreds.  Hence, I think work should be taken out of the inner loop 
for the common cases and instead deferred to the high-speed resize step which 
cleans out 100% of the dummy entries all at once.

The common cases all benefit (none of those have dummies, so there is no reason 
at all to check for dummies in set_add_entry).   The uncommon case (the mix of 
individual adds and discards) is about neutral or slightly positive (as 
explained above).  

I've tried to think of a case that would be negatively affected and all I can 
think of is repeatedly adding and removing exactly the *same* element or small 
group of elements.  In that case, the freeslot would be reused 100% of the time 
and the would never need a resize.  I've not seen such a case and if I had, I 
would still care about the common cases more.

Also, I like the simplification of set_add_entry() and the separation of 
concerns (dummy reclamation occurs in exactly one place and that one place 
eliminates 100% of the dummies in a single pass).

FWIW, there is a loose analogy between this idea and the trade-off between 
reference counting and GC.  Reference counting reuses memory quicker than 
waiting for GC to reclaim memory in one pass later, but it entails encumbering 
all of the setting and deleting code.  GC-only systems make the rest of the 
code much cleaner and faster, but they have to wait to reclaim memory all at 
once.  Where the analogy fails though is that use of reuse of dummies in sets 
is by far the least common case, that early freeslot checks only recapture a 
small fraction of the deletions (only the ones that happen to have exactly the 
same hash slot), and that early freeslot checks are completely wasted in all of 
the common cases (which typically have no dummies at all).

----------
nosy: +tim.peters

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue29476>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to