[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

Peter Lee (Jira) Sat, 04 Jul 2020 00:47:06 -0700


    [ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151216#comment-17151216
 ]


Peter Lee commented on COMPRESS-539:
------------------------------------

IIRC the IOUtils.skip was originally copied from Apache Commons IO(not a maven 
dependency to keep Compress as simple as we want it).

I just checked the IOUtils in Apache Commons IO and found out the skip method 
looks like this:
{code:java}
/**
 * Skips characters from an input character stream.
 * This implementation guarantees that it will read as many characters
 * as possible before giving up; this may not always be the case for
 * skip() implementations in subclasses of {@link Reader}.
 * <p>
 * Note that the implementation uses {@link Reader#read(char[], int, int)} 
rather
 * than delegating to {@link Reader#skip(long)}.
 * This means that the method may be considerably less efficient than using the 
actual skip implementation,
 * this is done to guarantee that the correct number of characters are skipped.
 * </p>
 *
 * @param input character stream to skip
 * @param toSkip number of characters to skip.
 * @return number of characters actually skipped.
 * @throws IOException              if there is a problem reading the file
 * @throws IllegalArgumentException if toSkip is negative
 * @see Reader#skip(long)
 * @see <a href="https://issues.apache.org/jira/browse/IO-203";>IO-203 - Add 
skipFully() method for InputStreams</a>
 * @since 2.0
 */
public static long skip(final Reader input, final long toSkip) throws 
IOException {
    if (toSkip < 0) {
        throw new IllegalArgumentException("Skip count must be non-negative, 
actual: " + toSkip);
    }
    /*
     * N.B. no need to synchronize this because: - we don't care if the buffer 
is created multiple times (the data
     * is ignored) - we always use the same size buffer, so if it it is 
recreated it will still be OK (if the buffer
     * size were variable, we would need to synch. to ensure some other thread 
did not create a smaller one)
     */
    if (SKIP_CHAR_BUFFER == null) {
        SKIP_CHAR_BUFFER = new char[SKIP_BUFFER_SIZE];
    }
    long remain = toSkip;
    while (remain > 0) {
        // See https://issues.apache.org/jira/browse/IO-203 for why we use 
read() rather than delegating to skip()
        final long n = input.read(SKIP_CHAR_BUFFER, 0, (int) Math.min(remain, 
SKIP_BUFFER_SIZE));
        if (n < 0) { // EOF
            break;
        }
        remain -= n;
    }
    return toSkip - remain;
}
{code}
Not calling inputstream.skip as you did in your patch. I believe this 
implemention would get the same memory benefit as your patch did.

BTW seems the IOUtils methods in Commons Compress are somehow out of date - 
maybe we can update them using reference from Commons IO.

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> ---------------------------------------------------------------------------------
>
>                 Key: COMPRESS-539
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-539
>             Project: Commons Compress
>          Issue Type: Bug
>    Affects Versions: 1.20
>            Reporter: Robin Schimpf
>            Priority: Major
>         Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
>     try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath()))) {
>         TarArchiveEntry entry;
>         while ((entry = in.getNextTarEntry()) != null) {
>         }
>     }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

Reply via email to