[ https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768905#action_12768905 ]
Doug Cutting commented on AVRO-160: ----------------------------------- > You could also store an offset pointer to the schema in every block header, > instead of the entire thing. Hmm. Seeks are expensive. If the schema never changed, and folks have to read the file header anyway, then I guess a pointer back to the first schema shouldn't create another seek, if we implement it well. But once you do change the schema, then many seeks in the file would have to do an extra seek back to read the schema. Caching would help, so maybe it's not a big problem, but, if schemas are small and fast to read, then it shouldn't be bad to put one at the start of each block. So, maybe this works... > What use cases are you thinking about? Append is one case where schemas might change: the appending program might differ from that that which originally created the file. Other cases are where folks want to lazily add schemas to a file as they write things. So, someone had an event logging system, where different daemons might log different events. You could use a schema that includes all possible events, if you knew them, or you could, on the fly, add new events to the top-level union the first time they're written. > file format should be friendly to streaming > ------------------------------------------- > > Key: AVRO-160 > URL: https://issues.apache.org/jira/browse/AVRO-160 > Project: Avro > Issue Type: Improvement > Components: spec > Reporter: Doug Cutting > > It should be possible to stream through an Avro data file without seeking to > the end. > Currently the interpretation is that schemas written to the file apply to all > entries before them. If this were changed so that they instead apply to all > entries that follow, and the initial schema is written at the start of the > file, then streaming could be supported. > Note that the only change permitted to a schema as a file is written is to, > if it is a union, to add new branches at the end of that union. If it is not > a union, no changes may be made. So it is still the case that the final > schema in a file can read every entry in the file and thus may be used to > randomly access the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.