Hello!

Please consider the following real utility method I have:

    /**
     * Returns an immutable list via {@link List#of} that is the
     * sorted content of the collection.
     */
    public static <E> @NonNull List<E> sortedImmutableList(
            @NonNull Collection<E> col,
            @Nullable Comparator<? super E> comparator) {
        ArrayList<E> list = new ArrayList<>(col);
        list.sort(comparator);
        return List.copyOf(list);
    }

Quiz question: Assume this method is invoked with a large HashSet as
its collection argument. How many ARRAY copies of the collection are
created by this method? Clearly there must be at least one for the
result, held internally in the immutable list. Any others?


(Scroll for the answer.)

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

I was surprised to realize the answer is four:

1. ArrayList calls toArray() on the HashSet.
2. ArrayList defensively copies the array to assign to its internal
elementData field, because it doesn't trust HashSet.
3. List.copyOf calls toArray() on the ArrayList.
4. List.of defensively copies the array to assign to its internal
elements field, because it doesn't trust ArrayList.

It is quite possible to rewrite my particular utility method with a
raw array instead of ArrayList, to eliminate 2 array copies, but this
requires an icky unchecked cast, and by rights is the kind of thing
that shouldn't be necessary anyway.

-----

In general, this kind of excessive defensive copying is going on ALL THE TIME.

The underlying nuisance is that a subclass of Collection could
potentially override toArray() to keep a reference to the returned
array and make mischief with it. (Also, for ArrayList, it would be a
problem if the returned array were a subclass of Object[].)

I did a search thru the JDK and found a slew of cases where code calls
Collection.toArray() and then has to worry about defensively copying
it -- or trying to avoid defensively copying it:

- java.util.List.copyOf(Collection) -- avoids copying an existing
List.of/copyOf list, but for any other collection it calls toArray()
and then defensively copies it again

- java.util.ArrayList(Collection) -- constructor has a special check
`if (collection.getClass() == ArrayList.class)`, in which case it
skips the defensive copy and trusts the result of toArray(), but it
does not trust any other collection

- java.util.Vector(Collection) -- constructor has the same
optimization trick as ArrayList, where it trusts ArrayList, but
creates a defensive copy for any other collection. Vector doesn't even
trust itself!

- java.util.PriorityQueue.initElementsFromCollection(Collection) --
has optimization for ArrayList, but no other collection

- java.util.concurrent.PriorityBlockingQueue(Collection) -- has
optimization for ArrayList, but no other collection

- java.util.concurrent.CopyOnWriteArrayList(Collection) -- has
optimizations for ArrayList and CopyOnWriteArrayList, but no other
collection.

- java.util.concurrent.CopyOnWriteArrayList.addAllAbsent(Collection)
-- has optimization for ArrayList but no other collection.

- java.util.stream.Collectors.toUnmodifiableList() -- this avoids an
excess copy, but does so in a very complicated way

- sun.awt.util.IdentityArrayList(Collection) -- this appears to be an
old copy-paste of ArrayList except with identity indexOf; like
pre-2020 ArrayList, it blindly trusts that `col.toArray()` returns a
new array, although based on this class's extremely limited use there
is no manifestable bug.

- java.lang.invoke.MethodHandles.dropArguments/dropArgumentsToMatch --
these call `list.toArray()` and then clone the array unconditionally

- java.time.zone.ZoneRules.of(...) -- calls `lastRules.toArray()` and
then clones the array unconditionally

-----

I would suggest that there should be an internal utility method for
use in all these cases, something like:

    public static Object[] safeToArray(Collection<?> c) {
        Object[] a = c.toArray();
        if (!isTrusted(c)) {
            a = Arrays.copyOf(a, a.length, Object[].class);
        }
        return a;
    }

The delicate issue then is: what is the correct design of "isTrusted"?
The check must be fast, and not spoofable by classes outside the JDK.

My first thought was to detect that the class is loaded by the
bootstrap class loader:

    static boolean isTrusted(Collection<?> c) {
        return c.getClass().getClassLoader() == null;
    }

However, I now realize this is unwise because it would shift the
burden of trust into **every** Collection class in the JDK, instead of
into the classes that actually care about the result. For example,
Collections.UnmodifiableCollection.toArray() simply returns the result
of its internal `col.toArray()`, which immediately carves a hole the
size of a bus thru the check.

Perhaps better would be for classes to explicitly opt-in and say "Yes,
I promise that my toArray() method is safe". Would a package-private
marker interface work for this?

I am not sure.

Anyway, the point is that classes within the JDK should not need to
create so many defensive copies of each other.

Note that this is becoming more of an issue now that List.of is
becoming a popular "go-to" list for many situations where ArrayList
would traditionally have been used. The "only trust ArrayList" pattern
was never ideal and is starting to age.

This is my first post here so I apologize for oversights in formatting
or etiquette.
Thanks for reading.

Reply via email to