[
https://issues.apache.org/jira/browse/AVRO-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799257#comment-13799257
]
Doug Cutting commented on AVRO-1387:
------------------------------------
This could be added as a new feature of the data file format, but that would
require all clients in all languages to be updated before they can process such
files.
Another approach might be to use a wrapper schema, like:
{code}
{"type":"record", "name":"org.apache.avro.file.CheckSummed", "fields":[
{"name":"value", "type":<insert your schema here>}
{"name":"checksum", "type":"long"}
]}
{code}
In Java we can provide classes that perform checksumming:
{code}
public class ChecksummedWriter<T> extends DataFileWriter<T>
@Override create(Schema s, File f) { ... } // wraps the schema with a
checksum schema
@Override append(T value) { ... } // appends a checksum after each
entry
}
public class ChecksummedReader<T> extends DataFileReader<T>
@Override T next() { ... } // validates checksum
}
{code}
This would permit implementations that know nothing about Avro checksumming to
still access data that has checksums. We could even change the
DataFileReader.openReader() factory to automatically return ChecksummedReader
when appropriate, making this transparent for Java applications.
> Avro container file format update to write checksums for individual record
> --------------------------------------------------------------------------
>
> Key: AVRO-1387
> URL: https://issues.apache.org/jira/browse/AVRO-1387
> Project: Avro
> Issue Type: Bug
> Reporter: Hari Shreedharan
>
> We are considering changes in Flume's file channel to use Avro, one of the
> requirements is that each event (which maps to one avro record) be
> checksummed so we know if the data is corrupt.
> We'd probably have to add a new version for this, since this will change the
> data format on disk. I can start working on a Java version if there are no
> objections
--
This message was sent by Atlassian JIRA
(v6.1#6144)