[
https://issues.apache.org/jira/browse/HADOOP-15196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630551#comment-16630551
]
Vinayakumar B commented on HADOOP-15196:
----------------------------------------
Thanks for the fix [~brahmareddy].
1. Patch fixes the said issue, except one case. i.e. If BuiltInGZipDecompressor
is used, and size of trailing garbage is less than 10 bytes.
Below change should be done in
{{BuiltInGZipDecompressor#executeHeaderState()}} to handle this case as well.
{code:java}
@@ -253,8 +266,11 @@ private void executeHeaderState() throws IOException {
if (state == GzipStateLabel.HEADER_BASIC) {
int n = Math.min(userBufLen, 10-localBufOff); // (or 10-headerBytesRead)
checkAndCopyBytesToLocal(n); // modifies userBufLen, etc.
- if (localBufOff >= 10) { // should be strictly ==
+ if (localBufOff > 0) { // should be strictly ==
processBasicHeader(); // sig, compression method, flagbits
+ if (ignoreTrailingGarbage) {
+ return;
+ }
localBufOff = 0; // no further need for basic header
state = GzipStateLabel.HEADER_EXTRA_FIELD;
}
{code}
2. Reset the {{newStream}} and {{ignoreTrailingGarbage}} flags if concatenated
stream have valid bytes.
Changes can be done in {{BuiltInGzipDecompressor#decompress()}} as below.
{code:java}
@@ -208,6 +216,11 @@ public synchronized int decompress(byte[] b, int off, int
len)
} catch (DataFormatException dfe) {
throw new IOException(dfe.getMessage());
}
+ if (newSteam) {
+ //Reset if new stream have valid bytes
+ newSteam = false;
+ ignoreTrailingGarbage = false;
+ }
crc.update(b, off, numAvailBytes); // CRC-32 is on _uncompressed_ data
if (inflater.finished()) {
state = GzipStateLabel.TRAILER_CRC;
{code}
3. A test needs to be added to verify this. With both Native and Non-Native
decompressors.
Creating the gzip file with trailing garbage is very easy. Just create a gzip
compressed file and append some extra bytes directly.
> Zlib decompression fails when file having trailing garbage
> ----------------------------------------------------------
>
> Key: HADOOP-15196
> URL: https://issues.apache.org/jira/browse/HADOOP-15196
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Brahma Reddy Battula
> Assignee: Brahma Reddy Battula
> Priority: Major
> Attachments: HADOOP-15196.patch
>
>
> *When file has trailing garbage gzip will ignore.*
> {noformat}
> gzip -d 2018011309-js.rishenglipin.com.gz
> gzip: 2018011309-js.rishenglipin.com.gz: decompression OK, trailing garbage
> ignored
> {noformat}
> *when we use same file and decompress,we got following.*
> {noformat}
> 2018-01-13 14:23:43,151 | WARN | task-result-getter-3 | Lost task 0.0 in
> stage 345.0 (TID 5686, node-core-gyVYT, executor 3): java.io.IOException:
> unknown compression method
> at
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
> Method)
> at
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:225)
> at
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:91)
> at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]