Hi Jens,
This is a situation where intuition can easily lead one astray. Indeed, I
initially made the same assumption [1] that x.removeAll(y) should essentially
call x.remove() on each element of y, as if y::forEach(x::remove). This would
imply that y is iterated and that x's membership semantics would be used.
However, after some further discussion, research, and analysis, I changed my
mind. [2] This is somewhat counterintuitive, but there are good reasons for
going the opposite direction.
First, consider the specification of Collection.removeAll:
Removes all of this collection's elements that are also contained in
the specified collection[...].
This is somewhat oddly worded, even stilted. Why doesn't it say something simple
like "removes each of the elements of the specified collection from this
collection"?
The answer is that they mean different things. Consider:
var list = new ArrayList<>(List.of("a", "a"))
list.removeAll(List.of("a"))
list ==> []
The word "all" is used in two senses simultaneously here: it removes from this
collection "all" occurrences of "all" the elements of the specified collection.
The implication here is that *this* collection is iterated, and an element is
removed if it's contained in the specified collection. That implies contains()
is called on the specified collection, and that uses the specified collection's
membership semantics. AbstractCollection.removeAll is implemented exactly this way.
Given that Set.removeAll overrides Collection.removeAll, they should have
compatible semantics to the greatest extent possible. (And yes, there are
various existing violations of Liskov subsitutability in the collections
framework, but changing this would introduce a new violation as well as a
massive incompatibility.)
You might say, the example above has duplicates in *this* collection, which
turns out to be a list. But this is different, because now we're talking about
sets, which can't have duplicates, so the alternative sense of removeAll doesn't
matter.
It still matters, because the notion of "duplicate" depends on whose membership
semantics are in use. Consider:
Collection<String> list = new ArrayList<>(List.of("a", "A"))
Collection<String> hash = new HashSet<>(list)
var tree = new TreeSet<>(String.CASE_INSENSITIVE_ORDER)
tree.add("A")
list.removeAll(tree)
list ==> []
hash.removeAll(tree)
hash ==> []
Even though neither the List nor the HashSet appears to have duplicates, they
both do when using the TreeSet's semantics. It would be terribly inconsistent to
have AbstractCollection.removeAll and AbstractSet.removeAll differ in which
collection's semantics they depend upon.
Finally, there is retainAll. It may be that removeAll is used more than
retainAll, but their specifications and semantics are inextricably linked. Their
specs are:
removeAll: Removes all of this collection's elements that are also
contained in the specified collection
retainAll: Retains only the elements in this collection that are
contained in the specified collection
The specifications are inverses of each other, so the result of removeAll should
be the complement of the result of retainAll, and both rely on the contains()
semantics of the specified collection.
**
It's perhaps unfortunate that Set.removeAll seems "backwards", but there are
good reasons it's this way. It needs to stay this way in order to remain
compatible with Collections.removeAll and to preserve its relationship with
retainAll.
s'marks
[1]
http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-January/058172.html
[2]
http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-February/058540.html
On 5/2/20 11:26 AM, Jens Lideström wrote:
I will chip in once more on this topic. (I did so once before [1], perhaps
in a much too wordy manner.)
The proposed changes in AbstractSet.removeAll will be inherited by
TreeSet. A TreeSet might use a comparator as a custom equality test.
Because of this it seems backwards if TreeSet.removeAll where to use the
equality semantics of the argument instead of the ones of the receiver,
which will be the case with the proposed changes. It seems much more
intuitive and regular to instead use the equality semantics of the receiver.
The same thing applies to other collections with custom equality tests.
It seems to be surprising and error prone to users if TreeSet where to
behave like this:
var s = new TreeSet<String>(String.CASE_INSENSITIVE_ORDER);
s.addAll(Set.of("A", "B"));
s.remove("a"); // Removes "A" as expected
s.removeAll(List.of("b")); // Ops, doesn't remove "B"!
Many other common methods use the equality semantics of the receiver:
containsAll, equals, remove, contains. To let TreeSet.removeAll be
inconsistent with all these methods seems worse than it being inconsistent
with the less commonly used TreeSet.retainAll.
The conclusion is that I think it would be better if TreeSet or
AbstractSet gets an implementation of removeAll which iterates over the
argument collection.
Best regards,
Jens Lideström
[1]:
https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061581.html
On 2020-05-01 22:01, Stuart Marks wrote:
Hi all,
I'm taking another swing at this one. I've taken a minimal approach
compared to my previous proposals.
Briefly, AbstractSet.removeAll switches from iterating this collection and
calling contains() on its argument, to iterating the argument and calling
this.contains(), if the argument collection is smaller than this
collection. This attempt at optimization is incorrect, because this
collection and the argument collection might have different semantics for
contains().
There is a confounding performance problem, which is that if the argument
is a List, contains() is generally linear, which can result in
pathologically slow (O(m*n)) performance. Because of the
iteration-switching above, this performance problem can appear and
disappear based on the relative sizes, which is startling.
The fix for the semantic problem is simply to remove the attempted
optimization from AbstractSet. This comports with the specification of
Collection.removeAll and Set.removeAll, which imply that contains() is
called on the argument collection. This allows sets to inherit the
implementation in AbstractCollection.removeAll. In addition, this ensures
that removeAll is now always the complement of retainAll (as it should
be), which it sometimes was not when the optimization was in place.
IdentityHashMap's keySet and entrySet views were broken by this
optimization, so they had to override removeAll with copies of the
implementation from AbstractCollection. Since they can now inherit from
AbstractCollection, these overrides have been removed.
This leaves a potential performance problem with removeAll when the
argument is a List. To mitigate this, HashSet.removeAll switches to
iterating the argument if the argument is a List. This is safe, as both
HashSet and List use the same semantics for contains() (at least, as far
as I know).
I've opted not to pursue size-based optimizations at this time, since they
provide only incremental benefit, whereas the pathological performance
problem mentioned above is the primary issue. Also, it's actually quite
difficult to determine when it's safe to switch iteration.
Finally, I've added performance notes to the specifications of
containsAll(), removeAll(), and retainAll(), warning about potential
performance problems that can occur with repeated calls to contains() made
by these methods.
Bug:
https://bugs.openjdk.java.net/browse/JDK-6394757
Webrev:
http://cr.openjdk.java.net/~smarks/reviews/6394757/webrev.2/
Previous discussions:
http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-July/007125.html
http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-January/058140.html
http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-February/058378.html
http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-May/060007.html
http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-May/060147.html
Thanks,
s'marks