Re: Excessive copying of Collection.toArray()

Shaun Spiller Fri, 05 Sep 2025 09:21:14 -0700

On Thu, 4 Sept 2025 at 03:17 UTC, Stuart Marks <stuart.ma...@oracle.com> wrote:
>
> Hi Shaun,
>
> You got pretty far with your analysis; congratulations!


Thank you!!

> You already proposed and knocked down the "obvious solution" which is to 
> trust JDK
> classes (which are loaded by the boot classloader). Similar simple solutions, 
> such
> as an allowlist with `instanceof` checks, can be defeated by subclassing.

Those two checks `&&`ed together would work, though, I think?

Here's a rough census of some ways to identify "trusted" collection classes:

* ClassLoader -- not selective enough on its own

* instanceof ClassName, combined with classloader check -- has the
hazard of being inheritable within the JDK; won't work for private
collection classes; O(n) where n is the number of classes checked

* instanceof MarkerInterface, combined with classloader check -- has
the hazard of being inheritable within the JDK; presence of marker
interface would be visible outside the JDK via reflection, which is
inelegant but mostly harmless

* .getClass().getInterfaces().contains(MarkerInterface) --
non-heritable version of instanceof

* A centrally-maintained table of trusted Class objects --
non-heritable; won't work for private collection classes; causes early
class loading

* A centrally-maintained table of trusted class *names* -- avoids
premature class loading; a brittle binding because it's not checked at
compile time

* Marker annotation on class -- could be very tidy but performance is
unclear and might have bootstrapping problems??

* Each participating class uses a static initializer to call a JDK
internal method to add its Class object to a dynamically populated
table of trusted classes -- extra complexity of dynamic table but I
believe it could work

* more???

It's something that would need checking by a security expert.

Any of the table-based solutions have the potential to become quite
large. If any JDK Collection is potentially allowed to join the
"trusted" club, that's a huge type hierarchy. So there is the question
of how to achieve O(1) lookup. I would instinctively reach for an
IdentityHashMap but maybe that's not bootstrap-friendly, I don't know.
Or maybe the trusted club really should just be very small and
insular. Trading security for performance is short-sighted.

Also, I see no options for anonymous collection classes to
participate, except stop being anonymous. Anonymous classes don't have
proper names, can't implement extra interfaces, can't have class
annotations, can't refer to their class object in a static
initializer, etc. There are very few of these, although
java.util.AbstractMap.keySet() is an example.

> But this would still leave non-JDK classes out in the cold. An arrangement 
> that
> could work would be to have some construct that allows the creation of and 
> writing
> to an array, but which precludes the possibility of retaining a reference to 
> the
> array (or perhaps merely preventing writes to it after a certain point). One 
> could
> imagine a library construct that implements this. Of course it would require 
> classes
> to opt into this arrangement. Ordinary implementors of toArray() would still 
> suffer
> an extra copy.

It's remarkable, because at its core it's such a basic thing for a
computer program to do: copy a list and stick it in another. The bugs
that result are highly non-obvious. When the need is basic but the
correct solution is subtle, it's the kind of bug that will keep
happening. Not just with ArrayList, but with anything conceptually
similar. It's like Java's equivalent of a buffer overflow. The
language design makes it inevitable for this to go wrong.

I think the simple, widely-applicable List.of and friends are very
helpful here because they make it easy for applications to move data
around securely and immutably, to do things the right way even if they
didn't realize there was a wrong way. If there were also a solution
for primitives -- frozen arrays or primitive generic lists -- that
could cover some more ground.

Around 2010, I wrote a terrible hack for the usual wrong reason
(performance): there was a ByteArrayOutputStream with a large blob of
data, and I wanted to avoid the clone cost of calling toByteArray(),
so I did `baos.writeTo(new OutputStream() { ... })` with an overridden
`OutputStream.write` that simply exfiltrated the passed buffer
argument -- letting me access, even modify, the internal buffer of the
ByteArrayOutputStream.

My intent wasn't nefarious. But the general principle is this: arrays
are dangerous. They're always mutable, widely shared around, and you
can't stop someone retaining a reference, eventually causing problems
deliberately or accidentally.

I had a vision similar to what you describe: Some kind of library
class that takes an array and provides a temporary, closeable view of
a slice of it, so access can be shared (read-only, write-only, or
read-write) inside an enforceable scope. But I think such a class
would be too cumbersome to see wide use unless its views could be
typecast to a "normal" array type. Then it could be used with
traditional APIs like `Collection.toArray(Object[])`, or
`OutputStream.write(byte[])`. But that turns a simple utility class
into a serious language and VM engineering project.

> But maybe this is overkill, and the best is the enemy of the good. Perhaps we 
> could
> add a short allowlist that contains just the most frequently-used JDK 
> classes. This
> wouldn't be exhaustive, but it could help the situation by covering more of 
> the
> common cases.

Yes, 100%.

The Pareto principle applies; most of the available benefit would come
from just a few cases.

It would be nice if at least ArrayList and List.of could trust each
other, providing a smooth transition between the mutable and immutable
realms.

> > Also, the knowledge of why the defensive clone is made at all, and why
> > it must be of type Object[] and not a covariant subclass, these are
> > currently arcane secrets held only by those who know the history of
> > the associated bug reports. That's another part of why I argue for a
> > central method that can be named and discussed.

> The need for an extra defensive copy was established and fixed via the 
> security
> vulnerability process and as such, all discussion of it was embargoed. Sorry.
> [...]
> But I think you've discerned the issue yourself already.

In 2014 I sent in a bug report, JDK-8048200, the "Widget" example
showing some toArray() mischief.

If I recall correctly, it was closed as a duplicate of some earlier
report or reports, ultimately disregarded because the exploit did not
seem potent enough to justify the cost of blocking it.

Years later I noticed the change to ArrayList, and discovered that my
bug report and dupes had been quietly disappeared from the web. I
can't even see the title. So I realized there had been some kind of
blowup. :)

Anyway, what I was wanting to say by complaining about the "arcane
secrets" is simply it's another reason why a documented method is
beneficial, to leave signposts in the code when dragons are nearby.
Another 50 years from now, those classes will still be in use. If
someone in the future reworking PriorityBlockingQueue sees this:

        if (c.getClass() != java.util.ArrayList.class)
            es = Arrays.copyOf(es, n, Object[].class);

they won't appreciate the detailed thought behind those two lines
unless they decide to do some extensive digital archeology. A
commented method can refer to past bug reports, ward off future bugs,
and help promote the right way of doing things in general. I've read
your blog post on the subject of "The Importance of Writing Stuff
Down" :)

I imagine it was initially undesirable to draw attention to the code
in an open-source security patch, but it's been 5 years.

~

Re: Excessive copying of Collection.toArray()

Reply via email to