[
https://issues.apache.org/jira/browse/COLLECTIONS-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14286597#comment-14286597
]
Oswaldo Olivo commented on COLLECTIONS-548:
-------------------------------------------
Here's another potential fix by using a more efficient retainAll method in the
spirit of CollectionUtils::containsAll
{code}
------------------------------------------------------------
// Eliminates from coll1 all elements that are not in coll2.
// It runs in O(m+n) size, requiring additional O(m) space.
public static boolean efficientRetainAll(final Collection<?> coll1,final
Collection<?> coll2) {
// If coll2 is empty there are no elements to retain.
if(coll2.isEmpty()) {
return coll1.removeAll(coll1);
}
// Simple case when we're supposed to retain all elements
// in the first collection.
if(coll1==coll2)
return false;
// it1 iterates over coll1 and it2 iterares over coll2,
// and seen contains all the elements before it2, allowing
// us to never revisit previous elements.
// The algorithm iterates through it1, checks to see if we've
// already seen the current element of it1 via a Hashset
// efficient check, or traverses elements of it2 until we find
// it or it2 ends. At each iteration over it2 we add the
// elements to seen to avoid revisiting items.
Iterator<?> it1 = coll1.iterator();
Iterator<?> it2 = coll2.iterator();
HashSet<Object> seen = new HashSet<Object>();
boolean changed=false;
// Traverse all the elements in coll1.
while(it1.hasNext()) {
final Object o=it1.next();
// If we've seen this element in coll2, keep it.
if(seen.contains(o))
continue;
// Otherwise, check for its containment in coll2, while
// adding the elements to seen.
boolean contained=false;
while(it2.hasNext()) {
final Object o2=it2.next();
seen.add(o2);
// Found the element in coll2.
if(o2.equals(o)) {
contained=true;
break;
}
}
// If the element was not found in coll2, remove it from it1.
if(!contained) {
changed=true;
it1.remove();
}
}
return changed;
}
{code}
----------------------
And the harness:
-----------------------
{code}
public static void compositeCollectionRetainAllTest(boolean original) {
int N=500000;
ArrayList<Integer> firstArrayList=new ArrayList<Integer>();
ArrayList<Integer> secondArrayList=new ArrayList<Integer>();
for(int i=0;i<N;++i) {
firstArrayList.add(new Integer(i));
secondArrayList.add(new Integer(N-1-i));
}
CompositeCollection col = new CompositeCollection(firstArrayList);
if(original)
col.retainAll(secondArrayList);
else
efficientRetainAll(col,secondArrayList);
}
{code}
------------------------------
The results are:
N Original(s) Repaired(s) Speed-up(X)
10 1.04 1.05 0.99
100 1.04 1.05 0.99
1000 1.06 1.06 1.00
10000 1.12 1.10 1.01
100000 9.34 1.15 8.12
500000 > 300 1.34 > 223.88
> Performance issue in CompositeCollection::retainAll
> ---------------------------------------------------
>
> Key: COLLECTIONS-548
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-548
> Project: Commons Collections
> Issue Type: Bug
> Components: Collection
> Affects Versions: 4.1
> Environment: Ubuntu 14.04
> Reporter: Oswaldo Olivo
> Priority: Minor
> Labels: performance
>
> There seems to be a performance problem with the function retainAll of
> CompositeCollection. This is analogous to
> https://issues.apache.org/jira/browse/COLLECTIONS-534 .
> The following is the code of the function:
> {code}
>
> /**
> * Retains all the elements in the specified collection in this composite
> collection,
> * removing all others.
> * <p>
> * This implementation calls <code>retainAll()</code> on each collection.
> *
> * @param coll the collection to remove
> * @return true if the collection was modified
> * @throws UnsupportedOperationException if retainAll is unsupported
> */
> public boolean retainAll(final Collection<?> coll) {
> boolean changed = false;
> for (final Collection<E> item : all) {
> changed |= item.retainAll(coll);
> }
> return changed;
> }
> {code}
> The performance problem occurs when the underlying collections in the current
> collection have a slow retainAll method. Whenever we're relying on
> Collection::retainAll, slow cases tend to occur when the parameter collection
> has a slow contains method.
> The following test harness shows the performance degradation between
> using a list and using a set as a parameter, across different collection
> sizes.
> {code}
> public static void compositeCollectionRetainAllTest(boolean original) {
> int N=500000;
> ArrayList<Integer> firstArrayList=new ArrayList<Integer>();
> ArrayList<Integer> secondArrayList=new ArrayList<Integer>();
> for(int i=0;i<N;++i) {
> firstArrayList.add(new Integer(i));
> secondArrayList.add(new Integer(N-1-i));
>
> }
> CompositeCollection col = new CompositeCollection(firstArrayList);
> col.retainAll(original ? secondArrayList : (new
> HashSet<Integer>(secondArrayList)));
>
> }
> {code}
> In the following table "Original" corresponds to the time taken by
> the existing implementation of CompositeCollection::retainAll, "Repaired" to
> the time taken by the function invoked with the parameter converted into a
> set, and "Speed-up" to the speed-up obtained after the repair.
> N Original(s) Repaired(s) Speed-up(X)
> 10 1.04 1.04 1.00
> 100 1.04 1.05 0.99
> 1000 1.06 1.06 1.00
> 10000 1.12 1.10 1.01
> 100000 9.34 1.15 8.12
> 500000 > 300 1.29 > 232.55
> If it's unacceptable to convert the parameter into a set before calling
> retainsAll, a solution would be to include a warning to the user in the
> documentation that the parameter should have a fast contains method when
> possible.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)