[jira] Commented: (AVRO-380) Avro Container File format change: add block size to block descriptor

Scott Carey (JIRA) Thu, 28 Jan 2010 12:13:59 -0800

    [ 
https://issues.apache.org/jira/browse/AVRO-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806067#action_12806067
 ]


Scott Carey commented on AVRO-380:
----------------------------------

bq. If you set syncInterval to Integer.MAX_VALUE, I bet things will break, 
because we allocate a ByteArrayOutputStream of 2*Integer.MAX_VALUE, which is 
going to overflow.

We probably will want to allocate ByteArrayOutputStream to MIN( estimated_need, 
Integer.MAX_VALUE).  Java can't create arrays with length larger than 
Integer.MAX_VALUE anyway.  

In order to support uncompressed blocks larger than that, we would need 
significant changed -- outside the scope of this and perhaps never needed.

I think we might want to limit syncInterval to 1GB though.  If someone tried to 
set it to near 2GB, and a value was big enough to take the block from just 
under this size to over 2GB it would break.

Since this format requires both the size of the block and the count of items 
before writing, it is probably not the best format for very large blocks.  One 
would want to stream that volume of data to disk and not buffer it, which means 
the count and size are unknown when first written.  This can only be done if 
the count and size is removed, or if the count and size is a fixed size field 
and the file supports random writes so that those fields can be written after 
the block.

bq. I think it's your responsibility here to check that block.length bytes were 
actually read. I used LenghtLimitedInputStream for that. You could certainly 
also read into the buffer until you've read the number of bytes you expect to. 
I think this also introduces an extra copy for the null codec, but, as you say, 
there are more things that could be done for performance here.

Yeah, I was trying to not copy too much, but the best way to do that is to 
complete the buffering performance and api changes to 
BinaryEncoder/BinaryDecoder.  I'll add the extra checks.


> Avro Container File format change:  add block size to block descriptor
> ----------------------------------------------------------------------
>
>                 Key: AVRO-380
>                 URL: https://issues.apache.org/jira/browse/AVRO-380
>             Project: Avro
>          Issue Type: Improvement
>          Components: doc, java, spec
>    Affects Versions: 1.3.0
>            Reporter: Scott Carey
>             Fix For: 1.3.0
>
>         Attachments: AVRO-380.patch
>
>
> The new file format in AVRO-160 limits a few use cases that I have found to 
> be important.
> A block currently contains a count of the number of records, the block data, 
> and a sync marker.  
> This change would add the block size, in bytes, along side the number of 
> records.   
> This allows efficient access to a block's data without the need to decode the 
> data into individual Datums, which is useful for various use cases.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-380) Avro Container File format change: add block size to block descriptor

Reply via email to