[
https://issues.apache.org/jira/browse/AVRO-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Douglas Creager updated AVRO-818:
---------------------------------
Attachment: 0001-AVRO-818.-Fix-data-file-corruption-bug-in-C-library.patch
Here's a fix. We now keep track of the size of the in-memory buffer after the
most recent successfully serialized record, and use this as the block size when
we write a block to disk. This ensures that even if there's any incomplete
records at the end of the memory buffer, we don't include them in the block.
> C data file writer produces corrupt blocks
> ------------------------------------------
>
> Key: AVRO-818
> URL: https://issues.apache.org/jira/browse/AVRO-818
> Project: Avro
> Issue Type: Bug
> Components: c
> Affects Versions: 1.5.1
> Reporter: Douglas Creager
> Assignee: Douglas Creager
> Attachments:
> 0001-AVRO-818.-Fix-data-file-corruption-bug-in-C-library.patch, quickstop.c
>
>
> The data file writer in the C library can produce corrupt blocks. The logic
> in datafile.c is that we have a fixed-buffer in-memory avro_writer_t
> instance. When you append records to the data file, they go into this memory
> buffer. If we get an error serializing into the memory buffer, it's
> presumably because we've filled it, so we write out the memory buffer's
> contents as a new block in the file, clear the buffer, and try again.
> The problem is that the failed serialization into the memory buffer isn't
> atomic; some of the serialization will have made it into the buffer before we
> discover that there's not enough room. And this incomplete record will then
> make it into the file.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira