Re: [PR] feat: Add incremental `toByteArray` method [commons-io]

via GitHub Mon, 08 Sep 2025 03:06:56 -0700


vy commented on code in PR #776:
URL: https://github.com/apache/commons-io/pull/776#discussion_r2329688265



##########
src/main/java/org/apache/commons/io/IOUtils.java:
##########
@@ -221,6 +221,21 @@ public class IOUtils {
      */
     private static final char[] SCRATCH_CHAR_BUFFER_WO = charArray();
 
+    /**
+     * The maximum size of an array in many Java VMs.
+     */
+    private static final int MAX_ARRAY_LENGTH = Integer.MAX_VALUE - 8;
+
+    /*
+     * Default maximum chunk size used when copying large streams into a byte 
array.
+     * <p>
+     * This value is somewhat arbitrary, currently aligned with the value used 
by
+     * <a 
href="https://github.com/python/cpython/blob/3.14/Lib/_pyio.py";>Python</a>
+     * for copying streams.
+     * </p>
+     */
+    private static final int DEFAULT_CHUNK_SIZE = 128 * 1024;

Review Comment:
   Food for thought: `Files.BUFFER_SIZE` is 8 KiB – maybe we shall stick to 
Java defaults? Note that 8 KiB will theoretically allow more threads consuming 
in parallel.



##########
src/main/java/org/apache/commons/io/IOUtils.java:
##########
@@ -221,6 +221,21 @@ public class IOUtils {
      */
     private static final char[] SCRATCH_CHAR_BUFFER_WO = charArray();
 
+    /**
+     * The maximum size of an array in many Java VMs.
+     */
+    private static final int MAX_ARRAY_LENGTH = Integer.MAX_VALUE - 8;

Review Comment:
   A reference to `ArraysSupport#SOFT_MAX_ARRAY_LENGTH` in the Javadoc would be 
handy for maintainers.



##########
src/main/java/org/apache/commons/io/IOUtils.java:
##########
@@ -2714,13 +2800,9 @@ static byte[] toByteArray(final IOTriFunction<byte[], 
Integer, Integer, Integer>
             return EMPTY_BYTE_ARRAY;
         }
         final byte[] data = byteArray(size);
-        int offset = 0;
-        int read;
-        while (offset < size && (read = input.apply(data, offset, size - 
offset)) != EOF) {
-            offset += read;
-        }
-        if (offset != size) {
-            throw new IOException("Unexpected read size, current: " + offset + 
", expected: " + size);
+        final int read = read(input, data, 0, size);

Review Comment:
   AFAICT, this _improvement_ feels unrelated with this PR. You might consider 
carrying this out in a separate PR.



##########
src/test/java/org/apache/commons/io/IOUtilsTest.java:
##########


Review Comment:
   I don't see a test that verifies the exception on insufficient data. Am I 
mistaken?



##########
src/main/java/org/apache/commons/io/IOUtils.java:
##########
@@ -2637,57 +2652,65 @@ public static BufferedReader toBufferedReader(final 
Reader reader, final int siz
     }
 
     /**
-     * Gets the contents of an {@link InputStream} as a {@code byte[]}.
-     * <p>
-     * This method buffers the input internally, so there is no need to use a 
{@link BufferedInputStream}.
-     * </p>
+     * Reads all the bytes from an input stream in a byte array.
      *
-     * @param inputStream the {@link InputStream} to read.
-     * @return the requested byte array.
-     * @throws NullPointerException if the InputStream is {@code null}.
-     * @throws IOException          if an I/O error occurs or reading more 
than {@link Integer#MAX_VALUE} occurs.
+     * <p>The memory used by this method is <strong>proportional</strong> to 
the number
+     * of bytes read, which is only limited by {@link Integer#MAX_VALUE}. This 
makes it unsuitable for
+     * processing large input streams, unless sufficient heap space is 
available.</p>
+     *
+     * @param inputStream The {@link InputStream} to read; must not be {@code 
null}.
+     * @return A new byte array.
+     * @throws IllegalArgumentException If the size of the stream is greater 
than the maximum array size.
+     * @throws IOException              If an I/O error occurs while reading.
+     * @throws NullPointerException     If {@code inputStream} is {@code null}.
      */
     public static byte[] toByteArray(final InputStream inputStream) throws 
IOException {
-        // We use a ThresholdingOutputStream to avoid reading AND writing more 
than Integer.MAX_VALUE.
-        try (UnsynchronizedByteArrayOutputStream ubaOutput = 
UnsynchronizedByteArrayOutputStream.builder().get();
-                ThresholdingOutputStream thresholdOutput = new 
ThresholdingOutputStream(Integer.MAX_VALUE, os -> {
-                    throw new IllegalArgumentException(String.format("Cannot 
read more than %,d into a byte array", Integer.MAX_VALUE));
-                }, os -> ubaOutput)) {
-            copy(inputStream, thresholdOutput);
-            return ubaOutput.toByteArray();
+        final UnsynchronizedByteArrayOutputStream output =
+                copyToOutputStream(inputStream, MAX_ARRAY_LENGTH + 1, 
DEFAULT_CHUNK_SIZE);
+        if (output.size() > MAX_ARRAY_LENGTH) {
+            throw new IllegalArgumentException(
+                    String.format("Cannot read more than %,d into a byte 
array", MAX_ARRAY_LENGTH));
         }
+        return output.toByteArray();
     }
 
     /**
-     * Gets the contents of an {@link InputStream} as a {@code byte[]}. Use 
this method instead of
-     * {@link #toByteArray(InputStream)} when {@link InputStream} size is 
known.
+     * Reads exactly {@code size} bytes from the given {@link InputStream} 
into a new {@code byte[]}.
      *
-     * @param input the {@link InputStream} to read.
-     * @param size the size of {@link InputStream} to read, where 0 &lt; 
{@code size} &lt;= length of input stream.
-     * @return byte [] of length {@code size}.
-     * @throws IOException if an I/O error occurs or {@link InputStream} 
length is smaller than parameter {@code size}.
-     * @throws IllegalArgumentException if {@code size} is less than zero.
+     * <p>The memory used by this method is <strong>proportional</strong> to 
the number

Review Comment:
   Okay... I see we're using `UnsynchronizedByteArrayOutputStream`, which 
allocates along the way. Now I understand why you avoided to provide an 
elaborate explanation on the memory cap. I was about to suggest you to shape 
the Javadoc similar to `InputStream::readNBytes`, but then I realized that you 
already did by using the `proportional` term here. I see you've done your 
homework. :+1:



##########
src/main/java/org/apache/commons/io/IOUtils.java:
##########
@@ -2637,57 +2652,65 @@ public static BufferedReader toBufferedReader(final 
Reader reader, final int siz
     }
 
     /**
-     * Gets the contents of an {@link InputStream} as a {@code byte[]}.
-     * <p>
-     * This method buffers the input internally, so there is no need to use a 
{@link BufferedInputStream}.
-     * </p>
+     * Reads all the bytes from an input stream in a byte array.
      *
-     * @param inputStream the {@link InputStream} to read.
-     * @return the requested byte array.
-     * @throws NullPointerException if the InputStream is {@code null}.
-     * @throws IOException          if an I/O error occurs or reading more 
than {@link Integer#MAX_VALUE} occurs.
+     * <p>The memory used by this method is <strong>proportional</strong> to 
the number
+     * of bytes read, which is only limited by {@link Integer#MAX_VALUE}. This 
makes it unsuitable for
+     * processing large input streams, unless sufficient heap space is 
available.</p>
+     *
+     * @param inputStream The {@link InputStream} to read; must not be {@code 
null}.
+     * @return A new byte array.
+     * @throws IllegalArgumentException If the size of the stream is greater 
than the maximum array size.
+     * @throws IOException              If an I/O error occurs while reading.
+     * @throws NullPointerException     If {@code inputStream} is {@code null}.
      */
     public static byte[] toByteArray(final InputStream inputStream) throws 
IOException {
-        // We use a ThresholdingOutputStream to avoid reading AND writing more 
than Integer.MAX_VALUE.
-        try (UnsynchronizedByteArrayOutputStream ubaOutput = 
UnsynchronizedByteArrayOutputStream.builder().get();
-                ThresholdingOutputStream thresholdOutput = new 
ThresholdingOutputStream(Integer.MAX_VALUE, os -> {
-                    throw new IllegalArgumentException(String.format("Cannot 
read more than %,d into a byte array", Integer.MAX_VALUE));
-                }, os -> ubaOutput)) {
-            copy(inputStream, thresholdOutput);
-            return ubaOutput.toByteArray();
+        final UnsynchronizedByteArrayOutputStream output =
+                copyToOutputStream(inputStream, MAX_ARRAY_LENGTH + 1, 
DEFAULT_CHUNK_SIZE);
+        if (output.size() > MAX_ARRAY_LENGTH) {
+            throw new IllegalArgumentException(
+                    String.format("Cannot read more than %,d into a byte 
array", MAX_ARRAY_LENGTH));
         }
+        return output.toByteArray();
     }
 
     /**
-     * Gets the contents of an {@link InputStream} as a {@code byte[]}. Use 
this method instead of
-     * {@link #toByteArray(InputStream)} when {@link InputStream} size is 
known.
+     * Reads exactly {@code size} bytes from the given {@link InputStream} 
into a new {@code byte[]}.
      *
-     * @param input the {@link InputStream} to read.
-     * @param size the size of {@link InputStream} to read, where 0 &lt; 
{@code size} &lt;= length of input stream.
-     * @return byte [] of length {@code size}.
-     * @throws IOException if an I/O error occurs or {@link InputStream} 
length is smaller than parameter {@code size}.
-     * @throws IllegalArgumentException if {@code size} is less than zero.
+     * <p>The memory used by this method is <strong>proportional</strong> to 
the number
+     * of bytes read and limited by the specified {@code size}. This makes it 
suitable for
+     * processing large input streams, provided that 
<strong>sufficient</strong> heap space is
+     * available.</p>
+     *
+     * @param input the {@link InputStream} to read; must not be {@code null}.
+     * @param size  the exact number of bytes to read; must be {@code >= 0}.
+     * @return a new byte array of length {@code size}.
+     * @throws IllegalArgumentException if {@code size} is negative.
+     * @throws EOFException             if the stream ends before {@code size} 
bytes are read.
+     * @throws IOException              if an I/O error occurs while reading.
+     * @throws NullPointerException     if {@code input} is {@code null}.
      * @since 2.1
      */
-    @SuppressWarnings("resource")
     public static byte[] toByteArray(final InputStream input, final int size) 
throws IOException {
-        return toByteArray(Objects.requireNonNull(input, "input")::read, size);
+        return toByteArray(input, size, DEFAULT_CHUNK_SIZE);

Review Comment:
   I don't think this necessarily implies double buffering: we can avoid the 
ephemeral buffer if `size <= chunkSize` – and this is what @ppkarwasz already 
does:
   
   
https://github.com/apache/commons-io/pull/776/files#diff-b106f51e7daf5b40ed10dccc4622f425f3e82afc52fb736547a8f28efa6478e4R2750-R2753



##########
src/main/java/org/apache/commons/io/IOUtils.java:
##########
@@ -2637,57 +2652,65 @@ public static BufferedReader toBufferedReader(final 
Reader reader, final int siz
     }
 
     /**
-     * Gets the contents of an {@link InputStream} as a {@code byte[]}.
-     * <p>
-     * This method buffers the input internally, so there is no need to use a 
{@link BufferedInputStream}.
-     * </p>
+     * Reads all the bytes from an input stream in a byte array.
      *
-     * @param inputStream the {@link InputStream} to read.
-     * @return the requested byte array.
-     * @throws NullPointerException if the InputStream is {@code null}.
-     * @throws IOException          if an I/O error occurs or reading more 
than {@link Integer#MAX_VALUE} occurs.
+     * <p>The memory used by this method is <strong>proportional</strong> to 
the number
+     * of bytes read, which is only limited by {@link Integer#MAX_VALUE}. This 
makes it unsuitable for
+     * processing large input streams, unless sufficient heap space is 
available.</p>
+     *
+     * @param inputStream The {@link InputStream} to read; must not be {@code 
null}.
+     * @return A new byte array.
+     * @throws IllegalArgumentException If the size of the stream is greater 
than the maximum array size.
+     * @throws IOException              If an I/O error occurs while reading.
+     * @throws NullPointerException     If {@code inputStream} is {@code null}.
      */
     public static byte[] toByteArray(final InputStream inputStream) throws 
IOException {
-        // We use a ThresholdingOutputStream to avoid reading AND writing more 
than Integer.MAX_VALUE.
-        try (UnsynchronizedByteArrayOutputStream ubaOutput = 
UnsynchronizedByteArrayOutputStream.builder().get();
-                ThresholdingOutputStream thresholdOutput = new 
ThresholdingOutputStream(Integer.MAX_VALUE, os -> {
-                    throw new IllegalArgumentException(String.format("Cannot 
read more than %,d into a byte array", Integer.MAX_VALUE));
-                }, os -> ubaOutput)) {
-            copy(inputStream, thresholdOutput);
-            return ubaOutput.toByteArray();
+        final UnsynchronizedByteArrayOutputStream output =
+                copyToOutputStream(inputStream, MAX_ARRAY_LENGTH + 1, 
DEFAULT_CHUNK_SIZE);
+        if (output.size() > MAX_ARRAY_LENGTH) {
+            throw new IllegalArgumentException(
+                    String.format("Cannot read more than %,d into a byte 
array", MAX_ARRAY_LENGTH));
         }
+        return output.toByteArray();
     }
 
     /**
-     * Gets the contents of an {@link InputStream} as a {@code byte[]}. Use 
this method instead of
-     * {@link #toByteArray(InputStream)} when {@link InputStream} size is 
known.
+     * Reads exactly {@code size} bytes from the given {@link InputStream} 
into a new {@code byte[]}.
      *
-     * @param input the {@link InputStream} to read.
-     * @param size the size of {@link InputStream} to read, where 0 &lt; 
{@code size} &lt;= length of input stream.
-     * @return byte [] of length {@code size}.
-     * @throws IOException if an I/O error occurs or {@link InputStream} 
length is smaller than parameter {@code size}.
-     * @throws IllegalArgumentException if {@code size} is less than zero.
+     * <p>The memory used by this method is <strong>proportional</strong> to 
the number

Review Comment:
   Can you explain what is meant with `proportional` here? As a user reading 
this to understand the API, I find it more confusing than helpful. Can't we 
simply say something as follows?
   
   > This method allocates at most `size` bytes, plus {@value MAX_ARRAY_LENGTH} 
bytes for the temporary buffer used to consume the stream.



##########
src/main/java/org/apache/commons/io/IOUtils.java:
##########
@@ -2637,57 +2652,65 @@ public static BufferedReader toBufferedReader(final 
Reader reader, final int siz
     }
 
     /**
-     * Gets the contents of an {@link InputStream} as a {@code byte[]}.
-     * <p>
-     * This method buffers the input internally, so there is no need to use a 
{@link BufferedInputStream}.
-     * </p>
+     * Reads all the bytes from an input stream in a byte array.
      *
-     * @param inputStream the {@link InputStream} to read.
-     * @return the requested byte array.
-     * @throws NullPointerException if the InputStream is {@code null}.
-     * @throws IOException          if an I/O error occurs or reading more 
than {@link Integer#MAX_VALUE} occurs.
+     * <p>The memory used by this method is <strong>proportional</strong> to 
the number
+     * of bytes read, which is only limited by {@link Integer#MAX_VALUE}. This 
makes it unsuitable for
+     * processing large input streams, unless sufficient heap space is 
available.</p>
+     *
+     * @param inputStream The {@link InputStream} to read; must not be {@code 
null}.
+     * @return A new byte array.
+     * @throws IllegalArgumentException If the size of the stream is greater 
than the maximum array size.
+     * @throws IOException              If an I/O error occurs while reading.
+     * @throws NullPointerException     If {@code inputStream} is {@code null}.
      */
     public static byte[] toByteArray(final InputStream inputStream) throws 
IOException {
-        // We use a ThresholdingOutputStream to avoid reading AND writing more 
than Integer.MAX_VALUE.
-        try (UnsynchronizedByteArrayOutputStream ubaOutput = 
UnsynchronizedByteArrayOutputStream.builder().get();
-                ThresholdingOutputStream thresholdOutput = new 
ThresholdingOutputStream(Integer.MAX_VALUE, os -> {
-                    throw new IllegalArgumentException(String.format("Cannot 
read more than %,d into a byte array", Integer.MAX_VALUE));
-                }, os -> ubaOutput)) {
-            copy(inputStream, thresholdOutput);
-            return ubaOutput.toByteArray();
+        final UnsynchronizedByteArrayOutputStream output =
+                copyToOutputStream(inputStream, MAX_ARRAY_LENGTH + 1, 
DEFAULT_CHUNK_SIZE);
+        if (output.size() > MAX_ARRAY_LENGTH) {

Review Comment:
   Can this ever happen? If not, I'd replace `if` with `assert`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] feat: Add incremental `toByteArray` method [commons-io]

Reply via email to