[jira] [Commented] (COMPRESS-666) Multithreaded access to Tar archive throws java.util.zip.ZipException: Corrupt GZIP trailer

Gary D. Gregory (Jira) Wed, 28 Feb 2024 14:06:37 -0800


    [ 
https://issues.apache.org/jira/browse/COMPRESS-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821879#comment-17821879
 ]


Gary D. Gregory commented on COMPRESS-666:
------------------------------------------

Hello [~cosmin79] 

What happens if you change the {{TarArchiveInputStream}} constructor to use the 
smallest buffers:
{code:java}
 TarArchiveInputStream tarInputStream = new TarArchiveInputStream(new 
BufferedInputStream(new GZIPInputStream(inputStream)),
TarConstants.DEFAULT_RCDSIZE, TarConstants.DEFAULT_RCDSIZE)
{code}
?

Note that our constants are misnamed. If you look at the tar.h file, you'll see 
a constant called {{BLOCKSIZE}} defined as {{{}512{}}}, but in our code we call 
this {{DEFAULT_RCDSIZE}} and we have a {{DEFAULT_BLKSIZE}} defined as 
{{{}DEFAULT_RCDSIZE * 20{}}}. See {{{}TarConstants{}}}.

What might be happening is that this test surfaces (maybe) an issue that's been 
there all along and that we are only now seeing. Either due to the above 
confusion or perhaps something like this: Under more load, the underlying input 
stream serves fewer (or more) bytes into buffers and this issue only happens 
when input is "chunked" into unlucky groups. Recall that when you ask a given 
number of bytes from an input stream, you are not guaranteed to get all bytes 
in one call. All this code is very much dependent on reading groups of bytes of 
certain size, skipping some regions, and then expecting data to be layed out a 
specific way.
 

> Multithreaded access to Tar archive throws java.util.zip.ZipException: 
> Corrupt GZIP trailer
> -------------------------------------------------------------------------------------------
>
>                 Key: COMPRESS-666
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-666
>             Project: Commons Compress
>          Issue Type: Bug
>    Affects Versions: 1.26.0
>         Environment: Commons compress 1.26.0 to get a failure. Any tar tgz.
>            Reporter: Cosmin Carabet
>            Priority: Major
>
> Something in 
> [https://github.com/apache/commons-compress/compare/rel/commons-compress-1.25.0...master]
>  seems to make iterating through the tar entries of multiple 
> TarArchiveInputStreams throw Corrupted TAR archive:
>  
> {code:java}
> @Test
> void bla() {
>     ExecutorService executorService = Executors.newFixedThreadPool(10);
>     List<CompletableFuture<Void>> tasks = IntStream.range(0, 200)
>             .mapToObj(_idx -> CompletableFuture.runAsync(
>                     () -> {
>                         try (InputStream inputStream = this.getClass()
>                                         .getResourceAsStream(
>                                                 "/<your favourite tar tgz>");
>                                 TarArchiveInputStream tarInputStream =
>                                         new TarArchiveInputStream(new 
> GZIPInputStream(inputStream))) {
>                             TarArchiveEntry tarEntry;
>                             while ((tarEntry = 
> tarInputStream.getNextTarEntry()) != null) {
>                                 System.out.println("Reading entry %s with 
> size %d"
>                                         .formatted(tarEntry.getName(), 
> tarEntry.getSize()));
>                             }
>                         } catch (Exception ex) {
>                             throw new RuntimeException(ex);
>                         }
>                     },
>                     executorService))
>             .toList();
>     Futures.getUnchecked(CompletableFuture.allOf(tasks.toArray(new 
> CompletableFuture<?>[0])));
> } {code}
> Although TarArchiveInputStream is marked as not thread safe, I am not reusing 
> objects here. Those are in fact separate objects, presumably all with their 
> own position tracking info.
>  
> The stacktrace here looks like:
> {code:java}
> Caused by: java.io.IOException: Corrupted TAR archive.
>     at 
> org.apache.commons.compress.archivers.tar.TarArchiveEntry.parseTarHeader(TarArchiveEntry.java:1480)
>     at 
> org.apache.commons.compress.archivers.tar.TarArchiveEntry.<init>(TarArchiveEntry.java:534)
>     at 
> org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:431)
>     at
> Caused by: java.lang.IllegalArgumentException: Invalid byte 100 at offset 0 
> in 'dddddddddddd' len=12
>     at 
> org.apache.commons.compress.archivers.tar.TarUtils.parseOctal(TarUtils.java:516)
>     at 
> org.apache.commons.compress.archivers.tar.TarUtils.parseOctalOrBinary(TarUtils.java:540)
>     at 
> org.apache.commons.compress.archivers.tar.TarArchiveEntry.parseTarHeaderUnwrapped(TarArchiveEntry.java:1496)
>     at 
> org.apache.commons.compress.archivers.tar.TarArchiveEntry.parseTarHeader(TarArchiveEntry.java:1478)
>     ... 7 more
>  {code}
> That code shows that occasionally the header is wrong (the tar entry name 
> contains gibberish bits) which makes me think that `getNextTarEntry()` can be 
> faulty.
>  
> Running that code with commons compress 1.25.0 works as expected. So it's 
> probably something added since November. Note that this is something related 
> to parallelism - using an executor service with a single thread doesn't 
> suffer from the same error. The tgz to decompress doesn't really matter - you 
> can use a manually created one worth a few KBs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (COMPRESS-666) Multithreaded access to Tar archive throws java.util.zip.ZipException: Corrupt GZIP trailer

Reply via email to