[
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152153#comment-17152153
]
Stefan Bodewig commented on COMPRESS-539:
-----------------------------------------
I'm afraid we've never run any measurements but relied on gut feeling.
The assumption here is that {{skip}} on a seekable stream should be faster if
skipping more than just a few bytes. I'm not sure how you have constructed the
example [~rschimpf] but I guess the entries are pretty small (a few kilobytes).
Our (well, my) expectation is that {{read}} in such a case might return cached
data as we've just read a tar header from the stream, but if the entry in
question was bigger - a few megabytes, maybe - then reading it should be
measurably slower than seeking ahead for the same amount of bytes.
> TarArchiveInputStream allocates a lot of memory when iterating through an
> archive
> ---------------------------------------------------------------------------------
>
> Key: COMPRESS-539
> URL: https://issues.apache.org/jira/browse/COMPRESS-539
> Project: Commons Compress
> Issue Type: Bug
> Affects Versions: 1.20
> Reporter: Robin Schimpf
> Assignee: Peter Lee
> Priority: Major
> Attachments: Don't_call_InputStream#skip.patch,
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png,
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png,
> image-2020-07-05-22-10-07-402.png, image-2020-07-05-22-11-25-526.png,
> image-2020-07-05-22-32-15-131.png, image-2020-07-05-22-32-31-511.png
>
>
> I iterated through the linux source tar and noticed some unneeded
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
> try (TarArchiveInputStream in = new
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath()))) {
> TarArchiveEntry entry;
> while ((entry = in.getNextTarEntry()) != null) {
> }
> }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder.
> Options used:
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
> Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
> This delegates in my test scenario to the InputStream.skip implementation
> which allocates a new byte[] for every invocation. By simply commenting out
> the while loop which calls the skip method the estimated TLAB allocation
> drops to 164MiB (-129MiB).
> !image-2020-06-21-10-58-43-255.png!
> Commenting out the skip call does not seem to be the best solution but it
> was quick for me to see how much memory can be saved. Also no unit tests
> where failing for me.
> 2. TarArchiveInputStream.readRecord
> For every read of the record a new byte[] is created. Since the record size
> does not change the byte[] can be reused and created when instantiating the
> TarStream. This optimization is already present in the
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB
> allocations further to 128MiB (-36MiB).
> !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)