[ 
https://issues.apache.org/jira/browse/COMPRESS-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909739#comment-16909739
 ] 

Stefan Bodewig commented on COMPRESS-376:
-----------------------------------------

[~jgustie] have you ever found time to give the branch a try? I'm looking 
through old branches to see whether they are needed any longer or can be 
removed.

> decompressConcatenated improvement
> ----------------------------------
>
>                 Key: COMPRESS-376
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-376
>             Project: Commons Compress
>          Issue Type: Improvement
>          Components: Compressors
>            Reporter: Jeremy Gustie
>            Priority: Major
>
> First the problem I am seeing: in general I am always setting 
> {{decompressConcatenated}} to {{true}}, most of the time this works fine. 
> However, it seems like some versions of Python tarfile will pad a compressed 
> TAR file with null bytes. The null bytes are recognized as garbage, causing 
> decompression to fail. Unfortunately this failure occurs while filling a 
> buffer for data used to read the final entry in the TAR file causing 
> {{TarArchiveInputStream.getNextEntry}} to fail before the last entry can be 
> returned.
> There are a couple of potential solutions I can see:
> 1. The easiest thing to do we be to special case the null padding and just 
> terminate without failing (in the {{GzipCompressorInputStream.init}} method, 
> this amounts to adding a check for {{magic0 == 0 && (magic1 == 0 || magic1 == 
> -1)}} and returning {{false}}). Perhaps draining the underlying stream to 
> ensure that the remaining bytes are all null could reduce the likelihood of a 
> false positive recognizing the padding.
> 2. Change {{decompressConcatenated}} to a tri-state value (maybe add an extra 
> {{ignoreGarbage}} flag) to suppress the failure; basically concatenated 
> streams would be decompressed only if the appropriate magic is found. This 
> has API impact but completely preserves backwards compatibility.
> 3. Finally, deferring the failure to the next read attempt may also be a 
> viable solution that nearly preserves backwards compatibility. As I mentioned 
> before, the "Garbage after..." error occurs while reading the final entry in 
> a TAR file: if the current read (which contains all of the final data from 
> the compression stream) were allowed to complete normally, the downstream 
> consumer might also complete normally; the next attempt to read (the garbage 
> past the end of the compression stream) would be the read that fails with the 
> "Garbage after..." error. This gives the downstream code the best opportunity 
> to both process the full compression stream and receive the unexpected 
> garbage failure.
> I was mostly looking at the {{GzipCompressorInputStream}}, I suspect similar 
> changes would be needed in the other decompress-concatenated compressor 
> streams.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to