[ 
https://issues.apache.org/jira/browse/AVRO-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799280#comment-13799280
 ] 

Brock Noland commented on AVRO-1387:
------------------------------------

Thanks for the suggestion Doug!  Two things we'd like do with the new format is:

1) Identity when a portion of the file is corrupt across records
2) Be able skip past corrupt records

Assuming the corruption happened outside of the flume record and corrupted 
Avro's own record metadata, would be able to identify this on read and skip 
ahead until we found a good record? What we'd like to have is something like so:

{noformat}
  Reader in = ..
  Record rec = null;
  do {
     try {
       rec = in.read();
       if(rec != null)
         doSomething(rec);
     } catch(InvalidRecordException ex) {
      in.seekNext();
    } while(rec != null);
 }
{noformat}

As our records are fairly large, I believe we plan on writing sync markers 
between each record so I think the seekNext() part should be doable. The thing 
that worries me is how avro will handle corruption of the avro event metadata 
itself. If it throws some kind of runtime exception (ArrayIndexOutofBounds, 
NullPointer, etc) I wouldn't want to "assume" that was corrupt data.

> Avro container file format update to write checksums for individual record
> --------------------------------------------------------------------------
>
>                 Key: AVRO-1387
>                 URL: https://issues.apache.org/jira/browse/AVRO-1387
>             Project: Avro
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>
> We are considering changes in Flume's file channel to use Avro, one of the 
> requirements is that each event (which maps to one avro record) be 
> checksummed so we know if the data is corrupt. 
> We'd probably have to add a new version for this, since this will change the 
> data format on disk. I can start working on a Java version if there are no 
> objections



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to