[ 
https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769302#action_12769302
 ] 

Doug Cutting commented on AVRO-160:
-----------------------------------

> It is useful to leave open the option for index type metadata in the metadata 
> block.

I don't see the use case.  Blocks should be small enough that seek time 
dominates, so scanning should not be a dominant cost.  Note that scanning is 
required anyway when blocks are compressed.  Avro scanning should be at least 
as fast as decompression.

> if we ever want true optimized random access [ ... ]

So long as we're supporting compression, we'll never support seeks directly to 
individual entries.  But blocks are a relatively constant size, so we can 
support constant-time access to individual entries.

> This allows MapReduce to not have to "seek and scan" but instead find the 
> start of the metadata block nearest the HDFS block boundary.

Yes, if we kept a global block index, we could avoid this scan.  However since 
HDFS blocks are ~64MB, and Avro file blocks are ~64k, the scan is less than a 
tenth of a percent of the overall map cost, so this is perhaps not a worthwhile 
optimization.

> Maybe, a straightforward thing to do is consider that each block in this file 
> has a header, a data block, and a footer.

That could work.  We'd also need to terminate blocks with the length of their 
footer metadata, so that a reader can efficiently find the last footer on open, 
where, by convention, global data is written.

> file format should be friendly to streaming
> -------------------------------------------
>
>                 Key: AVRO-160
>                 URL: https://issues.apache.org/jira/browse/AVRO-160
>             Project: Avro
>          Issue Type: Improvement
>          Components: spec
>            Reporter: Doug Cutting
>
> It should be possible to stream through an Avro data file without seeking to 
> the end.
> Currently the interpretation is that schemas written to the file apply to all 
> entries before them.  If this were changed so that they instead apply to all 
> entries that follow, and the initial schema is written at the start of the 
> file, then streaming could be supported.
> Note that the only change permitted to a schema as a file is written is to, 
> if it is a union, to add new branches at the end of that union.  If it is not 
> a union, no changes may be made.  So it is still the case that the final 
> schema in a file can read every entry in the file and thus may be used to 
> randomly access the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to