Right, as I mentioned in my earlier emails [1][2] this is related to JDK-6394757
[3] where the semantics shift depending on the relative sizes of the
collections. This has a complicated history. In JDK-6982173 [4] there is a lot
of discussion about what heuristics to use for iterating this vs the arg, in
order to get the best performance. This assumes that these operations are
equivalent, when they aren't! In JDK-6394757 there's some discussion about the
dubious semantics of this optimization and the potential of simply removing it.
At this point I think we need to remove the performance heurstic in order to get
consistent semantics. The performance won't be optimal -- indeed, it wasn't
before -- but at least it should be more predictable.
Your example is indeed counterintuitive, as it uses the contains() semantics of
the argument -- the string length -- instead of the contains() semantics of
"this". However, it's defensible, and it's supported by the spec, so I don't
really see any alternative.
The example set is pretty counterintuitive in the first place. For example:
jshell> Set<String> set = new TreeSet<>(Comparator.comparingInt(String::length))
jshell> set.addAll(List.of("a", "if", "the", "when"))
jshell> set
set ==> [a, if, the, when]
jshell> set.contains("foo")
$235 ==> true
Also, this comparator isn't consistent with equals. This isn't a bug. But it
does give rise to even more counterintuitive behavior:
jshell> var set2 = Set.of("x", "xx", "xxx", "xxxx")
jshell> set2.equals(set)
$237 ==> false
jshell> set.equals(set2)
$238 ==> true
Fun with collections! :-)
s'marks
[1]
http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-January/058172.html
[2]
http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-February/058539.html
[3] https://bugs.openjdk.java.net/browse/JDK-6394757
[4] https://bugs.openjdk.java.net/browse/JDK-6982173
On 2/11/19 2:23 PM, Michael Rasmussen wrote:
The current implementation seems very counter-intuitive with anything that
doesn't have the same comparison semantics, for instance a TreeSet/Map with a
Comparator, Identity-based Set/Map etc.
The output of the following snippet would probably surprise most:
/* --- snip --- */
import java.util.*;
class Test {
public static void main(String[] args) {
TreeMap<String, List<String>> lengthMap = new
TreeMap<>(Comparator.comparingInt(String::length));
List<String> strings = new ArrayList<>(Arrays.asList("The quick brown fox jumps over the
lazy dog".split(" ")));
for (String s : strings) {
lengthMap.computeIfAbsent(s, k -> new ArrayList<>()).add(s);
}
Set<String> toRemove = lengthMap.keySet();
System.out.println("List before: " + strings);
System.out.println("Set to remove: " + toRemove);
strings.removeAll(toRemove);
System.out.println("List after:" + strings);
}
}
/* --- snip --- */
List before: [The, quick, brown, fox, jumps, over, the, lazy, dog]
Set to remove: [The, over, quick]
List after:[]
/Michael
From: core-libs-dev <core-libs-dev-boun...@openjdk.java.net> on behalf of Tagir
Valeev <amae...@gmail.com>
Sent: 08 February 2019 15:13
To: Stuart Marks
Cc: core-libs-dev; Jan Peter Stotz
Subject: Re: JDK-6982173: Small problem causing thousands of wasted CPU hours
Hello!
I would argue that iterating the argument and calling remove() on "this" are the
right semantics, because you want set membership to be determined by this set,
not by whatever collection you pass as an argument. However, I note that
AbstractCollection.removeAll and other removeAll implementations iterate over
"this" and do a contains() check on the argument. The revised
AbstractSet.removeAll would be an outlier here, though it makes sense to me to
do it this way.
For complete picture it should be noted that there's a slight
difference in remove and removeAll spec: remove removes at most one
element while removeAll removes all elements from the specified
collection.
E.g. c.removeAll(Collections.singleton(foo)) would remove all
instances of foo from c while c.remove(foo) would return only one foo.
These should be equivalent for Set where repeating elements should not
normally appear, but it's wrong for any Collection. That's why
AbstractCollection.removeAll
cannot be rewritten in the same way.
Another interesting thing is
java.util.IdentityHashMap.KeySet#removeAll implementation [1]:
it reimplements the AbstractCollection#removeAll, because of the same
reason: now
the first branch of AbstractSet#removeAll becomes incorrect. See the
comment before it:
/*
* Must revert from AbstractSet's impl to AbstractCollection's, as
* the former contains an optimization that results in incorrect
* behavior when c is a smaller "normal" (non-identity-based) Set.
*/
Btw this comment should be updated to remove a "smaller" condition if
the proposed
change for AbstractSet will be implemented.
[1]
http://hg.openjdk.java.net/jdk/jdk/file/e57bcfd7bf79/src/java.base/share/classes/java/util/IdentityHashMap.java#l990
With best regards,
Tagir Valeev
On Fri, Jan 25, 2019 at 7:11 AM Stuart Marks <stuart.ma...@oracle.com> wrote:
On 1/23/19 12:05 PM, Jan Peter Stotz wrote:
like many other I ran into bug JDK-698217 which is about AbstractSet.removeAll()
and it's "aptitude" in wasting CPU time if you accidentally hit this bug. And
there are hundred of developers hitting this bug every year.
I simply don't understand why there was no progress in 8 years, on a severe
performance issue (a removeAll method on an efficient set that can require
O(n^2)!) caused by code that was written to speed-up the removeAll
implementation.
Which makes this bug worse is that it is triggered by the relative size of the
current set compared to the collection passed as parameter.
Therefore for most developers this means not to use this buggy function at all
(once they realized how worse it is).
I wasn't aware that hundreds of developers are hitting this bug every year. I
haven't seen any mention of it (besides in the bug database) on Twitter, on
Reddit, on DZone, at the conferences I attend, or in several years of
core-libs-dev emails. Well, it was mentioned on core-libs-dev once in 2011 [1]
although that was a suggestion for improvement, not a complaint about
performance.
Nonetheless this is a real problem, and if it's causing difficulties we can
certainly take a look at it.
There is, however, more to the story. The ACTUAL problem is a semantic one; see
JDK-6394757. [2] Briefly, consider x.removeAll(y). Depending on the relative
sizes of x and y, this method might end up using either x's or y's definition of
membership, which could differ from each other. (See the bug report for an
example.) Thus the semantics of this method depend upon the relative sizes of
the collections, which is arguably flawed.
Worse, this behavior is specified to iterate this set or the argument, depending
upon their relative sizes. [3] So, fixing this will require an incompatible
specification change.
The obvious way to fix this is to get rid of the "optimizations" (that turn out
not to be optimizations at all in some cases) and replace it with a simple loop:
public boolean removeAll(Collection<?> c) {
Objects.requireNonNull(c);
boolean modified = false;
for (Object e : c)
modified |= remove(e);
return modified;
}
I would argue that iterating the argument and calling remove() on "this" are the
right semantics, because you want set membership to be determined by this set,
not by whatever collection you pass as an argument. However, I note that
AbstractCollection.removeAll and other removeAll implementations iterate over
"this" and do a contains() check on the argument. The revised
AbstractSet.removeAll would be an outlier here, though it makes sense to me to
do it this way.
Is it worth the incompatibility?
s'marks
[1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-July/007125.html
[2] https://bugs.openjdk.java.net/browse/JDK-6394757
[3]
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/AbstractSet.html#removeAll(java.util.Collection)