[
https://issues.apache.org/jira/browse/GROOVY-10964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702100#comment-17702100
]
Ingo Wilms edited comment on GROOVY-10964 at 3/18/23 11:04 AM:
---------------------------------------------------------------
[~paulk] thank you so much! It's really incredible how fast things are put in
motion here. That motivated me to look into the implementation of TreeSet and I
believe that is where the real culprit lies. While TreeSet overrides
{{{}remove(){}}}, it inherits {{removeAll()}} from AbstractSet. If I am not
mistaken this is the code:
{code:groovy}
public boolean removeAll(Collection<?> c) {
Objects.requireNonNull(c);
boolean modified = false;
Iterator i;
Object e;
if (this.size() > c.size()) {
for(i = c.iterator(); i.hasNext(); modified |= this.remove(e)) {
e = i.next();
}
} else {
i = this.iterator(); while(i.hasNext()) {
if (c.contains(i.next())) {
i.remove();
modified = true;
}
}
} return modified;
}{code}
While I understand it might be clever to iterate through less objects, the use
of {{contains()}} in the {{{}else{}}}-block leads to two problems:
# It is inefficient if {{this}} is a TreeSet and {{c}} is an ArrayList,
because {{contains()}} is cheap for the tree, but not for the list.
# Since {{contains()}} has a different meaning for TreeSet with custom
comparator than {{for}} ArrayList, it can lead to different results.
I guess this is one of the cases why the
[doc|https://docs.oracle.com/en/java/javase/18/docs/api/java.base/java/util/TreeSet.html]
for TreeSet recommends the {{comparator}} to be consistent with
{{{}equals(){}}}, but I was'nt aware that this also applies for the elements
you want to remove. To be fair, this behavoir for {{removeAll}} is well
[documented|https://docs.oracle.com/en/java/javase/18/docs/api/java.base/java/util/AbstractSet.html#removeAll(java.util.Collection)],
but I read that part for the first time.
Here is an example for different running times
{code:groovy}
int n = 100_000
// This is fast, because tree.size() > list.size() and tree.remove() is fast
List list1 = (1..<n).collect { it }
TreeSet tree1 = (1..n).collect {it}
tree1.removeAll(list1)
println("fast version finished")
// This is slow, because tree.size() <= list.size() and list.contains() is slow
List list2 = (1..n).collect { it }
TreeSet tree2 = (1..n).collect {it}
tree2.removeAll(list2)
println("slow version finished") {code}
And here is an example for different results
{code:groovy}
List a = [1, 2]
List b1 = [2.0]
List b2 = [2.0, 3]
def tree1 = new TreeSet(new NumberAwareComparator())
tree1.addAll(a)
tree1.removeAll(b1)
def tree2 = new TreeSet(new NumberAwareComparator())
tree2.addAll(a)
tree2.removeAll(b2)
assert (tree1 != tree2)
assert tree1.toSet() == [1] as Set
assert tree2.toSet() == [1, 2.0] as Set {code}
I guess I will use {{removeAll()}} with more caution in the future.
was (Author: JIRAUSER299120):
[~paulk] thank you so much! It's really incredible how fast things are put in
motion here. That motivated me to look into the implementation of TreeSet and I
believe that is where the real culprit lies. While TreeSet overrides
{{{}remove(){}}}, it inherits {{removeAll()}} from AbstractSet. If I am not
mistaken this is the code:
{code:groovy}
public boolean removeAll(Collection<?> c) {
Objects.requireNonNull(c);
boolean modified = false;
Iterator i;
Object e;
if (this.size() > c.size()) {
for(i = c.iterator(); i.hasNext(); modified |= this.remove(e)) {
e = i.next();
}
} else {
i = this.iterator(); while(i.hasNext()) {
if (c.contains(i.next())) {
i.remove();
modified = true;
}
}
} return modified;
}{code}
While I understand it might be clever to iterate through less objects, the use
of {{contains()}} in the {{{}else{}}}-block leads to two problems:
# It is inefficient if {{this}} is a TreeSet and {{c}} is an ArrayList,
because {{contains()}} is cheap for the tree, but not for the list.
# Since {{contains()}} has a different meaning for TreeSet with custom
comparator than {{for}} ArrayList, it can lead to different results.
I guess this is one of the cases why the
[doc|https://docs.oracle.com/en/java/javase/18/docs/api/java.base/java/util/TreeSet.html]
for TreeSet recommends the {{comparator}} to be consistent with
{{{}equals(){}}}, but I was'nt aware that this also applies for the elements
you want to remove. To be fair, this behavoir for {{removeAll}} is well
[documented|https://docs.oracle.com/en/java/javase/18/docs/api/java.base/java/util/AbstractSet.html#removeAll(java.util.Collection)],
but I read that part for the first time.
Here is an example for different running times
{code:groovy}
int n = 100_000
// This is fast, because tree.size() > list.size() and tree.remove() is fast
List list1 = (1..<n).collect { it }
TreeSet tree1 = (1..n).collect {it}
tree1.removeAll(list1)
println("fast version finished")
// This is slow, because tree.size() <= list.size() and list.contains() is slow
List list2 = (1..n).collect { it }
TreeSet tree2 = (1..n).collect {it}
tree2.removeAll(list2)
println("slow version finished") {code}
And here is an example for different results
{code:groovy}
List a = [2, 3.0, 4, 5, 6 ]
List b1 = [2.0, 2, 3, 7]
List b2 = [2.0, 2, 3, 7, 8]
def tree1 = new TreeSet(new NumberAwareComparator())
tree1.addAll(a)
tree1.removeAll(b1)
def tree2 = new TreeSet(new NumberAwareComparator())
tree2.addAll(a)
tree2.removeAll(b2)
assert (tree1 != tree2)
assert tree1.toSet() == [4, 5, 6] as Set
assert tree2.toSet() == [3.0, 4, 5, 6] as Set {code}
I guess I will use {{removeAll()}} with more caution in the future.
> List.minus() slow for Numbers
> -----------------------------
>
> Key: GROOVY-10964
> URL: https://issues.apache.org/jira/browse/GROOVY-10964
> Project: Groovy
> Issue Type: Question
> Components: groovy-jdk
> Affects Versions: 2.4.0, 3.0.9, 4.0.9
> Reporter: Ingo Wilms
> Priority: Minor
> Labels: Collections, Groovy, List, Number
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> In List.minus() is a n*LOG\(n) version for comparable objects. Only for
> numbers, there is a dedicated slower n^2*LOG\(n) version. Is there a reason
> for this? It exists since 2.4.0 and hasn't changed much since then. Here is
> part of the code from version 4.0.9:
>
> {code:java}
> // if (nlgnSort && (head instanceof Comparable)) {
> //n*LOG(n) version
> Set<T> answer;
> if (head instanceof Number) {
> answer = new TreeSet<>(comparator);
> answer.addAll(self1);
> for (T t : self1) {
> if (t instanceof Number) {
> for (Object t2 : removeMe1) {
> if (t2 instanceof Number) {
> if (comparator.compare(t, (T) t2) == 0)
> answer.remove(t);
> }
> }
> } else {
> if (removeMe1.contains(t))
> answer.remove(t);
> }
> }
> } else {
> answer = new TreeSet<>(comparator);
> answer.addAll(self1);
> answer.removeAll(removeMe1);
> }
> for (T o : self1) {
> if (answer.contains(o))
> ansCollection.add(o);
> }
> } else {
> //n*n version {code}
> I fail to see why the whole extra block for numbers beginning with
> {code:java}
> if (head instanceof Number) { {code}
> is necessary.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)