Raymond Hettinger <raymond.hettin...@gmail.com> added the comment:

Here's the results of the research so far:

Guido gave it a -1:   
http://mail.python.org/pipermail/python-ideas/2012-March/014510.html

Smalltalk does not return a boolean:    
http://www.gnu.org/software/smalltalk/manual-base/gst-base.html#Set
http://www.gnu.org/software/smalltalk/manual-base/gst-base.html#HashedCollection
    add: newObject
    Add newObject to the set, if and only if the set doesn't already contain an 
occurrence of it. Don't fail if a duplicate is found. Answer anObject

Java does return a boolean for Interface Set<E>>: 
http://docs.oracle.com/javase/6/docs/api/
    boolean     add(E e)  Adds the specified element to this set if it is not 
already present (optional operation)

The SETL programming language does not return a boolean for the "set plus one 
element operation".  Instead, it returns the new set.
http://setl.org/setl/doc/setl-lib.html#with

Real-world examples of code using set.add in a way that benefits from the 
boolean result do not look promising.  They match the pattern in the minimal 
example shown in the previous post, "if e in previously_seen.add(e): ..."   
This is problematic because it can easily be read as "take this branch if e has 
been previously seen".

I consulted with the PyPy developers and found that PyPy is already eliminating 
double lookups in common cases (it recognizes that the two lookups refer to the 
same entry and constant folds them).  It cannot do this in the general case 
because of a potential semantic change (i.e. __eq__ can change the set during 
the initial lookup).

Looking at Jython, it appears that both sets and dicts are based on Java's 
concurrent mapping object which doesn't support a boolean result directly but 
allows the JIT to optimize away the double lookup depending on the JIT and on 
whether the keys are of all the same type.

One other note, in general the Python language makes no guarantees about 
atomicity (different implementations are free to make different implementation 
choices).  For example, people relying on dict.setdefault() to be atomic were 
making an incorrect assumption (until recently).  Our standing recommendation 
is to use locks unless your willing to rely on an implementation detail.  In 
the case of set.add(), the race condition is harmless since set.add() is 
idempotent.

I've shown sample code to some other developers and they had mixed guesses 
about the polarity of s.add(e), whether it would return True or False if the e 
was already in the set.  A mixed result is not good and implies that the 
proposal is error prone.  

Also, when shown the dedup() code example, the devs didn't feel that the 
additional conciseness was worth it (one said, "it just looks funny and causes 
me to do a double-take, but the original code could be read effortlessly").

My timings on the dedup() code showed a <2% speedup on an iterable of strings 
with 80% duplicates and no improvement on an iterable of strings with 5% 
duplicates.  The improvement was mostly erased if a "seen_add = seen.add" bound 
method was used inside the loop.  This indicates that the double lookup is 
cheap compared to the cost of the dotted lookup in "seen.add(e)".  That being 
said, objects with a custom __hash__ would benefit from calling __hash__ only 
once.

Given that the results above were uniformly negative (except for the Java 
library), I'll skip trying this out on my students and am rejecting the 
suggested change to the pure Python API.  We appreciate your suggestion but it 
isn't appropriate for inclusion in the language.

Some of those concerns may also apply to the suggested change to the C API

----------
resolution:  -> rejected
status: open -> closed

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue14320>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to