[ https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769302#action_12769302 ]
Doug Cutting commented on AVRO-160: ----------------------------------- > It is useful to leave open the option for index type metadata in the metadata > block. I don't see the use case. Blocks should be small enough that seek time dominates, so scanning should not be a dominant cost. Note that scanning is required anyway when blocks are compressed. Avro scanning should be at least as fast as decompression. > if we ever want true optimized random access [ ... ] So long as we're supporting compression, we'll never support seeks directly to individual entries. But blocks are a relatively constant size, so we can support constant-time access to individual entries. > This allows MapReduce to not have to "seek and scan" but instead find the > start of the metadata block nearest the HDFS block boundary. Yes, if we kept a global block index, we could avoid this scan. However since HDFS blocks are ~64MB, and Avro file blocks are ~64k, the scan is less than a tenth of a percent of the overall map cost, so this is perhaps not a worthwhile optimization. > Maybe, a straightforward thing to do is consider that each block in this file > has a header, a data block, and a footer. That could work. We'd also need to terminate blocks with the length of their footer metadata, so that a reader can efficiently find the last footer on open, where, by convention, global data is written. > file format should be friendly to streaming > ------------------------------------------- > > Key: AVRO-160 > URL: https://issues.apache.org/jira/browse/AVRO-160 > Project: Avro > Issue Type: Improvement > Components: spec > Reporter: Doug Cutting > > It should be possible to stream through an Avro data file without seeking to > the end. > Currently the interpretation is that schemas written to the file apply to all > entries before them. If this were changed so that they instead apply to all > entries that follow, and the initial schema is written at the start of the > file, then streaming could be supported. > Note that the only change permitted to a schema as a file is written is to, > if it is a union, to add new branches at the end of that union. If it is not > a union, no changes may be made. So it is still the case that the final > schema in a file can read every entry in the file and thus may be used to > randomly access the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.