[ https://issues.apache.org/jira/browse/AVRO-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806067#action_12806067 ]
Scott Carey commented on AVRO-380: ---------------------------------- bq. If you set syncInterval to Integer.MAX_VALUE, I bet things will break, because we allocate a ByteArrayOutputStream of 2*Integer.MAX_VALUE, which is going to overflow. We probably will want to allocate ByteArrayOutputStream to MIN( estimated_need, Integer.MAX_VALUE). Java can't create arrays with length larger than Integer.MAX_VALUE anyway. In order to support uncompressed blocks larger than that, we would need significant changed -- outside the scope of this and perhaps never needed. I think we might want to limit syncInterval to 1GB though. If someone tried to set it to near 2GB, and a value was big enough to take the block from just under this size to over 2GB it would break. Since this format requires both the size of the block and the count of items before writing, it is probably not the best format for very large blocks. One would want to stream that volume of data to disk and not buffer it, which means the count and size are unknown when first written. This can only be done if the count and size is removed, or if the count and size is a fixed size field and the file supports random writes so that those fields can be written after the block. bq. I think it's your responsibility here to check that block.length bytes were actually read. I used LenghtLimitedInputStream for that. You could certainly also read into the buffer until you've read the number of bytes you expect to. I think this also introduces an extra copy for the null codec, but, as you say, there are more things that could be done for performance here. Yeah, I was trying to not copy too much, but the best way to do that is to complete the buffering performance and api changes to BinaryEncoder/BinaryDecoder. I'll add the extra checks. > Avro Container File format change: add block size to block descriptor > ---------------------------------------------------------------------- > > Key: AVRO-380 > URL: https://issues.apache.org/jira/browse/AVRO-380 > Project: Avro > Issue Type: Improvement > Components: doc, java, spec > Affects Versions: 1.3.0 > Reporter: Scott Carey > Fix For: 1.3.0 > > Attachments: AVRO-380.patch > > > The new file format in AVRO-160 limits a few use cases that I have found to > be important. > A block currently contains a count of the number of records, the block data, > and a sync marker. > This change would add the block size, in bytes, along side the number of > records. > This allows efficient access to a block's data without the need to decode the > data into individual Datums, which is useful for various use cases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.