[ 
https://issues.apache.org/jira/browse/AVRO-541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896944#action_12896944
 ] 

Scott Carey commented on AVRO-541:
----------------------------------

OK, this one is getting interesting.  I wish it wasn't so interesting...

So, here are some perplexing details noticed while stepping through in the 
debugger:
* Concatenate from gzip-6 >> null uncompresses the gzip blocks and writes 
uncompressed blocks.  This is where we are seeing the bug.
* Take the exact same gzip-6 file, and concatenate it to another gzip'ed file 
and tell it to force compression, causes the same code to execute on the front 
side as expected:  uncompress the blocks.
* In the first case above, the data sometimes comes out of a block corrupted.  
The first 'chunk' from java.util.zip.Inflater is good, then it is junk after 
that.  This junk is always at the end of our block as the first chunk from 
Inflater is slightly smaller than our block size.
* In the second case, the data does not come out of the block corrupted!  That 
is, sometimes Inflater.java uncompresses the same block fine, and sometimes it 
does not.
* If you change the avro file block size (syncInterval) you will get different 
results.  A different block size will cause _different_ blocks of the file to 
be corrupt, or none of them corrupt.  So it seems that the random seed's 
influence on reproducing the bug it is not caused by a data block's data 
contents, but its data size in relation to the block size.
* If you change the schema from (string, long) to (long, string) you get 
different errors -- mostly hard exceptions rather than validation errors.
* The issue always corresponds with the InflaterInputStream not returning '-1' 
at the end of stream but throwing an apparently spurious exception (because the 
same data file can sometimes be uncompressed successfully).


The good news:
When you comment out the JUnit assertEquals(), the test still fails because 
Avro detects that the block is corrupted, or some other error occurs.  So most 
likely any user running into this in the real world would not have silent data 
corruption.

I currently use this feature quite a bit, but only concatenating from gzip to 
gzip without uncompressing the block (I can concatenate faster than the disks 
can handle this way -- from thousands of smaller files into larger ones).

I should have more time to look into this later today.  Some next steps:
Produce a reduced test case where decompression fails, side by side with one 
where it works on the same file.  Hopefully this will either help pinpoint a 
bug in our use of InflaterInputStream or in Inflater itself.
Possible work-arounds and code cleanup:
- Use Inflater directly instead of InflaterInputStream to reduce the layers 
between gzip compression and the JNI code in Inflater.java.
- Refactor DataFileStream and DataFileWriter to use the same codepath for block 
decompression.



> Java: TestDataFileConcat sometimes fails
> ----------------------------------------
>
>                 Key: AVRO-541
>                 URL: https://issues.apache.org/jira/browse/AVRO-541
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Scott Carey
>            Priority: Critical
>             Fix For: 1.4.0
>
>         Attachments: AVRO-541.patch
>
>
> TestDataFileConcat intermittently fails with:
> {code}
> Testcase: testConcateateFiles[5] took 0.032 sec
>         Caused an ERROR
> java.io.IOException: Block read partially, the data may be corrupt
> org.apache.avro.AvroRuntimeException: java.io.IOException: Block read 
> partially, the data may be corrupt
>         at 
> org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:173)
>         at org.apache.avro.file.DataFileStream.next(DataFileStream.java:193)
>         at 
> org.apache.avro.TestDataFileConcat.testConcateateFiles(TestDataFileConcat.java:141)
> Caused by: java.io.IOException: Block read partially, the data may be corrupt
>         at 
> org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:157)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to