[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

Stefan Bodewig (Jira) Sat, 04 Jul 2020 04:14:06 -0700


    [ 
https://issues.apache.org/jira/browse/COMPRESS-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151283#comment-17151283
 ]


Stefan Bodewig commented on COMPRESS-539:
-----------------------------------------

not calling skip may reduce the amount of memory consumed but will slow things 
down considerably as skip is usually far more efficient than read. In your case 
I'd recommend wrapping your input stream into a {{SkipShieldingInputStream}} 
instead, which will have the same effect as commenting out the skip calls.

Re-using the buffer in {{TarArchiveInputStream}} sounds right. No idea why this 
is different from TarArchiveOutputStream where it seems we've used a shared 
buffer since forever.

[~peterlee] the team has had this discussion around COMPRESS-443 - a while 
before you joined - and back then explicitly decided we wanted to keep using 
{{skip}} for performance reasons.

> TarArchiveInputStream allocates a lot of memory when iterating through an 
> archive
> ---------------------------------------------------------------------------------
>
>                 Key: COMPRESS-539
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-539
>             Project: Commons Compress
>          Issue Type: Bug
>    Affects Versions: 1.20
>            Reporter: Robin Schimpf
>            Assignee: Peter Lee
>            Priority: Major
>         Attachments: Don't_call_InputStream#skip.patch, 
> Reuse_recordBuffer.patch, image-2020-06-21-10-58-07-917.png, 
> image-2020-06-21-10-58-43-255.png, image-2020-06-21-10-59-10-825.png
>
>
>  I iterated through the linux source tar and noticed some unneeded 
> allocations happen without extracting any data.
> Reproducing code
> {code:java}
> File tarFile = new File("linux-5.7.1.tar");
>     try (TarArchiveInputStream in = new 
> TarArchiveInputStream(Files.newInputStream(tarFile.toPath()))) {
>         TarArchiveEntry entry;
>         while ((entry = in.getNextTarEntry()) != null) {
>         }
>     }
> {code}
> The measurement was done on Java 11.0.7 with the Java Flight Recorder. 
> Options used: 
> -XX:StartFlightRecording=settings=profile,filename=allocations.jfr
> Baseline with the current master implementation:
>  Estimated TLAB allocation: 293MiB
> !image-2020-06-21-10-58-07-917.png!
> 1. IOUtils.skip -> input.skip(numToSkip)
>  This delegates in my test scenario to the InputStream.skip implementation 
> which allocates a new byte[] for every invocation. By simply commenting out 
> the while loop which calls the skip method the estimated TLAB allocation 
> drops to 164MiB (-129MiB).
>  !image-2020-06-21-10-58-43-255.png! 
>  Commenting out the skip call does not seem to be the best solution but it 
> was quick for me to see how much memory can be saved. Also no unit tests 
> where failing for me.
> 2. TarArchiveInputStream.readRecord
>  For every read of the record a new byte[] is created. Since the record size 
> does not change the byte[] can be reused and created when instantiating the 
> TarStream. This optimization is already present in the 
> TarArchiveOutputStream. Reusing the buffer reduces the estimated TLAB 
> allocations further to 128MiB (-36MiB).
>  !image-2020-06-21-10-59-10-825.png!
> I attached the patches I used so the results can be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COMPRESS-539) TarArchiveInputStream allocates a lot of memory when iterating through an archive

Reply via email to