[ https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768871#action_12768871 ]
Doug Cutting commented on AVRO-160: ----------------------------------- > Perhaps this type should not be optimized for random access. For mapreduce, we need to be able to seek to an arbitrary point in the file, then scan to the next sync point and start reading the file. That's mostly what I mean by random access. It should also be possible to layer indexes on top of this, to support random access by key. Indexes might be stored as side files, or perhaps in the file's metadata. To support these, it should be possible to ask, while writing, the position of the current block start, so that one may store that in an index and subsequently seek to it, then scan the block for the desired entry. Let me elaborate on my last proposal. We put a schema at the start of every block. Every entry in a block must use the same schema. If you change the schema while writing, then you must start writing a new block. In effect, the schema is a compression dictionary for the block. (Blocks are also the unit of compression.) Benefits: - supports streaming - supports random access - permits arbitrary schema changes Costs: - increases the file size, but this can be ameliorated by: -- writing the schema in binary (using a schema for schemas) and/or -- writing larger blocks I think it still may make sense to flush metadata at the end of the file. It may no longer contain the schema, but it can contain things like counts and indexes. Streaming applications would not be able to use this, but other applications might find it very useful. Side files in HDFS are expensive. > file format should be friendly to streaming > ------------------------------------------- > > Key: AVRO-160 > URL: https://issues.apache.org/jira/browse/AVRO-160 > Project: Avro > Issue Type: Improvement > Components: spec > Reporter: Doug Cutting > > It should be possible to stream through an Avro data file without seeking to > the end. > Currently the interpretation is that schemas written to the file apply to all > entries before them. If this were changed so that they instead apply to all > entries that follow, and the initial schema is written at the start of the > file, then streaming could be supported. > Note that the only change permitted to a schema as a file is written is to, > if it is a union, to add new branches at the end of that union. If it is not > a union, no changes may be made. So it is still the case that the final > schema in a file can read every entry in the file and thus may be used to > randomly access the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.