[
https://issues.apache.org/jira/browse/AVRO-541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896944#action_12896944
]
Scott Carey commented on AVRO-541:
----------------------------------
OK, this one is getting interesting. I wish it wasn't so interesting...
So, here are some perplexing details noticed while stepping through in the
debugger:
* Concatenate from gzip-6 >> null uncompresses the gzip blocks and writes
uncompressed blocks. This is where we are seeing the bug.
* Take the exact same gzip-6 file, and concatenate it to another gzip'ed file
and tell it to force compression, causes the same code to execute on the front
side as expected: uncompress the blocks.
* In the first case above, the data sometimes comes out of a block corrupted.
The first 'chunk' from java.util.zip.Inflater is good, then it is junk after
that. This junk is always at the end of our block as the first chunk from
Inflater is slightly smaller than our block size.
* In the second case, the data does not come out of the block corrupted! That
is, sometimes Inflater.java uncompresses the same block fine, and sometimes it
does not.
* If you change the avro file block size (syncInterval) you will get different
results. A different block size will cause _different_ blocks of the file to
be corrupt, or none of them corrupt. So it seems that the random seed's
influence on reproducing the bug it is not caused by a data block's data
contents, but its data size in relation to the block size.
* If you change the schema from (string, long) to (long, string) you get
different errors -- mostly hard exceptions rather than validation errors.
* The issue always corresponds with the InflaterInputStream not returning '-1'
at the end of stream but throwing an apparently spurious exception (because the
same data file can sometimes be uncompressed successfully).
The good news:
When you comment out the JUnit assertEquals(), the test still fails because
Avro detects that the block is corrupted, or some other error occurs. So most
likely any user running into this in the real world would not have silent data
corruption.
I currently use this feature quite a bit, but only concatenating from gzip to
gzip without uncompressing the block (I can concatenate faster than the disks
can handle this way -- from thousands of smaller files into larger ones).
I should have more time to look into this later today. Some next steps:
Produce a reduced test case where decompression fails, side by side with one
where it works on the same file. Hopefully this will either help pinpoint a
bug in our use of InflaterInputStream or in Inflater itself.
Possible work-arounds and code cleanup:
- Use Inflater directly instead of InflaterInputStream to reduce the layers
between gzip compression and the JNI code in Inflater.java.
- Refactor DataFileStream and DataFileWriter to use the same codepath for block
decompression.
> Java: TestDataFileConcat sometimes fails
> ----------------------------------------
>
> Key: AVRO-541
> URL: https://issues.apache.org/jira/browse/AVRO-541
> Project: Avro
> Issue Type: Bug
> Components: java
> Reporter: Doug Cutting
> Assignee: Scott Carey
> Priority: Critical
> Fix For: 1.4.0
>
> Attachments: AVRO-541.patch
>
>
> TestDataFileConcat intermittently fails with:
> {code}
> Testcase: testConcateateFiles[5] took 0.032 sec
> Caused an ERROR
> java.io.IOException: Block read partially, the data may be corrupt
> org.apache.avro.AvroRuntimeException: java.io.IOException: Block read
> partially, the data may be corrupt
> at
> org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:173)
> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:193)
> at
> org.apache.avro.TestDataFileConcat.testConcateateFiles(TestDataFileConcat.java:141)
> Caused by: java.io.IOException: Block read partially, the data may be corrupt
> at
> org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:157)
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.