Hi Jens,

This is a situation where intuition can easily lead one astray. Indeed, I initially made the same assumption [1] that x.removeAll(y) should essentially call x.remove() on each element of y, as if y::forEach(x::remove). This would imply that y is iterated and that x's membership semantics would be used. However, after some further discussion, research, and analysis, I changed my mind. [2] This is somewhat counterintuitive, but there are good reasons for going the opposite direction.

First, consider the specification of Collection.removeAll:

    Removes all of this collection's elements that are also contained in
    the specified collection[...].

This is somewhat oddly worded, even stilted. Why doesn't it say something simple like "removes each of the elements of the specified collection from this collection"?

The answer is that they mean different things. Consider:

    var list = new ArrayList<>(List.of("a", "a"))
    list.removeAll(List.of("a"))
    list ==> []

The word "all" is used in two senses simultaneously here: it removes from this collection "all" occurrences of "all" the elements of the specified collection.

The implication here is that *this* collection is iterated, and an element is removed if it's contained in the specified collection. That implies contains() is called on the specified collection, and that uses the specified collection's membership semantics. AbstractCollection.removeAll is implemented exactly this way.

Given that Set.removeAll overrides Collection.removeAll, they should have compatible semantics to the greatest extent possible. (And yes, there are various existing violations of Liskov subsitutability in the collections framework, but changing this would introduce a new violation as well as a massive incompatibility.)

You might say, the example above has duplicates in *this* collection, which turns out to be a list. But this is different, because now we're talking about sets, which can't have duplicates, so the alternative sense of removeAll doesn't matter.

It still matters, because the notion of "duplicate" depends on whose membership semantics are in use. Consider:

    Collection<String> list = new ArrayList<>(List.of("a", "A"))
    Collection<String> hash = new HashSet<>(list)
    var tree = new TreeSet<>(String.CASE_INSENSITIVE_ORDER)
    tree.add("A")
    list.removeAll(tree)
    list ==> []
    hash.removeAll(tree)
    hash ==> []

Even though neither the List nor the HashSet appears to have duplicates, they both do when using the TreeSet's semantics. It would be terribly inconsistent to have AbstractCollection.removeAll and AbstractSet.removeAll differ in which collection's semantics they depend upon.

Finally, there is retainAll. It may be that removeAll is used more than retainAll, but their specifications and semantics are inextricably linked. Their specs are:

    removeAll: Removes all of this collection's elements that are also
    contained in the specified collection

    retainAll: Retains only the elements in this collection that are
    contained in the specified collection

The specifications are inverses of each other, so the result of removeAll should be the complement of the result of retainAll, and both rely on the contains() semantics of the specified collection.

**

It's perhaps unfortunate that Set.removeAll seems "backwards", but there are good reasons it's this way. It needs to stay this way in order to remain compatible with Collections.removeAll and to preserve its relationship with retainAll.

s'marks


[1] 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-January/058172.html

[2] 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-February/058540.html




On 5/2/20 11:26 AM, Jens Lideström wrote:
I will chip in once more on this topic. (I did so once before [1], perhaps
in a much too wordy manner.)

The proposed changes in AbstractSet.removeAll will be inherited by
TreeSet. A TreeSet might use a comparator as a custom equality test.

Because of this it seems backwards if TreeSet.removeAll where to use the
equality semantics of the argument instead of the ones of the receiver,
which will be the case with the proposed changes. It seems much more
intuitive and regular to instead use the equality semantics of the receiver.

The same thing applies to other collections with custom equality tests.

It seems to be surprising and error prone to users if TreeSet where to
behave like this:

var s = new TreeSet<String>(String.CASE_INSENSITIVE_ORDER);
s.addAll(Set.of("A", "B"));
s.remove("a"); // Removes "A" as expected
s.removeAll(List.of("b")); // Ops, doesn't remove "B"!

Many other common methods use the equality semantics of the receiver:
containsAll, equals, remove, contains. To let TreeSet.removeAll be
inconsistent with all these methods seems worse than it being inconsistent
with the less commonly used TreeSet.retainAll.

The conclusion is that I think it would be better if TreeSet or
AbstractSet gets an implementation of removeAll which iterates over the
argument collection.

Best regards,
Jens Lideström

[1]:
https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061581.html

On 2020-05-01 22:01, Stuart Marks wrote:
Hi all,

I'm taking another swing at this one. I've taken a minimal approach
compared to my previous proposals.

Briefly, AbstractSet.removeAll switches from iterating this collection and
calling contains() on its argument, to iterating the argument and calling
this.contains(), if the argument collection is smaller than this
collection. This attempt at optimization is incorrect, because this
collection and the argument collection might have different semantics for
contains().

There is a confounding performance problem, which is that if the argument
is a List, contains() is generally linear, which can result in
pathologically slow (O(m*n)) performance. Because of the
iteration-switching above, this performance problem can appear and
disappear based on the relative sizes, which is startling.

The fix for the semantic problem is simply to remove the attempted
optimization from AbstractSet. This comports with the specification of
Collection.removeAll and Set.removeAll, which imply that contains() is
called on the argument collection. This allows sets to inherit the
implementation in AbstractCollection.removeAll. In addition, this ensures
that removeAll is now always the complement of retainAll (as it should
be), which it sometimes was not when the optimization was in place.

IdentityHashMap's keySet and entrySet views were broken by this
optimization, so they had to override removeAll with copies of the
implementation from AbstractCollection. Since they can now inherit from
AbstractCollection, these overrides have been removed.

This leaves a potential performance problem with removeAll when the
argument is a List. To mitigate this, HashSet.removeAll switches to
iterating the argument if the argument is a List. This is safe, as both
HashSet and List use the same semantics for contains() (at least, as far
as I know).

I've opted not to pursue size-based optimizations at this time, since they
provide only incremental benefit, whereas the pathological performance
problem mentioned above is the primary issue. Also, it's actually quite
difficult to determine when it's safe to switch iteration.

Finally, I've added performance notes to the specifications of
containsAll(), removeAll(), and retainAll(), warning about potential
performance problems that can occur with repeated calls to contains() made
by these methods.

Bug:

     https://bugs.openjdk.java.net/browse/JDK-6394757

Webrev:

     http://cr.openjdk.java.net/~smarks/reviews/6394757/webrev.2/

Previous discussions:

http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-July/007125.html

http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-January/058140.html

http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-February/058378.html


     http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-May/060007.html

     http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-May/060147.html

Thanks,

s'marks

Reply via email to