[ https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795624#action_12795624 ]
Doug Cutting commented on AVRO-160: ----------------------------------- Andrew: sorry, I committed this before I saw your comments. > I see that SYNC_INTERVAL is a constant. Should be configurable? Yes. I've filed AVRO-274 to address this. > Some kind of flush method that forces writeBlock() would work. DataFileWriter#flush() and #sync() both force a writeBlock(). The difference is that the #sync() method does not also force a flush of the file to disk and it returns the position of the sync point, for passing to DataFileReader#seek(). > does it make sense to put the schema in multiple places, like super blocks in > ext3? This file format does not attempt to address data integrity issues, rather trusting that to the filesytem. To process a file whose first block is corrupted would be difficult not just because of the missing schema, but also because of the missing sync marker. The sync marker may be recoverable from EOF if the file is not truncated, but that is difficult to detect with certainty. > file format should be friendly to streaming > ------------------------------------------- > > Key: AVRO-160 > URL: https://issues.apache.org/jira/browse/AVRO-160 > Project: Avro > Issue Type: Improvement > Components: spec > Reporter: Doug Cutting > Assignee: Doug Cutting > Fix For: 1.3.0 > > Attachments: AVRO-160-python.patch, AVRO-160.patch, AVRO-160.patch, > AVRO-160.patch, AVRO-160.patch > > > It should be possible to stream through an Avro data file without seeking to > the end. > Currently the interpretation is that schemas written to the file apply to all > entries before them. If this were changed so that they instead apply to all > entries that follow, and the initial schema is written at the start of the > file, then streaming could be supported. > Note that the only change permitted to a schema as a file is written is to, > if it is a union, to add new branches at the end of that union. If it is not > a union, no changes may be made. So it is still the case that the final > schema in a file can read every entry in the file and thus may be used to > randomly access the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.