[ https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768839#action_12768839 ]
Scott Carey commented on AVRO-160: ---------------------------------- Sounds to me like this file type is trying to be too many things. Perhaps this type should not be optimized for random access. Perhaps it is possible but slow, seeking forward or back for a metadata block to find the schema for that block? A file type optimized for random access would either need embedded indexes or external indexes anyway -- at minimum indexes to the start of each block. And it has very different "schema visibility and compatibility" requirements. I believe that if this main file type is optimized for streaming writes and reads, and possible appending writes and "seek and stream" reads, many challenges are simplified. It will be simpler, easier to test and implement, and meet the majority of use cases. A format designed with random access in mind can come later. I suspect that due to the "schema visibility" requirements of random access there will be significant differences not just in implementation, but API. Additionally, it may be possible for an index file to encapsulate all the random access concerns and use the above format for its data storage. For example, the index over the raw file can be built by one streaming read, and modified with each appends write. > file format should be friendly to streaming > ------------------------------------------- > > Key: AVRO-160 > URL: https://issues.apache.org/jira/browse/AVRO-160 > Project: Avro > Issue Type: Improvement > Components: spec > Reporter: Doug Cutting > > It should be possible to stream through an Avro data file without seeking to > the end. > Currently the interpretation is that schemas written to the file apply to all > entries before them. If this were changed so that they instead apply to all > entries that follow, and the initial schema is written at the start of the > file, then streaming could be supported. > Note that the only change permitted to a schema as a file is written is to, > if it is a union, to add new branches at the end of that union. If it is not > a union, no changes may be made. So it is still the case that the final > schema in a file can read every entry in the file and thus may be used to > randomly access the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.