[
https://issues.apache.org/jira/browse/AVRO-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992661#comment-12992661
]
Scott Carey commented on AVRO-753:
----------------------------------
I need to change BlockingBinaryEncoder as part of this process. It appears
that I can simplify it significantly since both the new BinaryEncoder and the
blocking variant will need to buffer data in a similar way and thus they will
share a lot more code.
I want to clarify how it should work. The
[spec|http://avro.apache.org/docs/current/spec.html#binary_encoding] doesn't
seem to have the answer.
It says: "If a block's count is negative, its absolute value is used, and the
count is followed immediately by a long block size indicating the number of
bytes in the block." and "The blocked representation permits one to read and
write arrays larger than can be buffered in memory, since one can start writing
items without knowing the full length of the array."
If you need to write the length of the block, how can you write without knowing
the full length of the array? Looking at the code, it mentions this:
{quote}
"Regular" blocks have a non-zero byte count.
"Overflow" blocks help us deal with the case where a block
contains a value that's too big to buffer. In this case, the
block contains only one item, and we give it an unknown
byte-count. Because these values (1,unknown) are fixed, we're
able to write the header for these overflow blocks to the
underlying stream without seeing the entire block. After writing
this header, we've freed our buffer space to be fully devoted to
blocking the large, inner value.
{quote}
The spec does not mention that a block can have an 'unknown' byte count. Is
this something that should be added to the spec? Or is it somewhere else that
I did not notice?
The code indicates that an 'overflow' block has one item (count = -1) and size
= 0. That seems a little ambiguous, "-1, -1" would make more sense since a
negative size is impossible. A valid record can have size zero if it only
contains null fields.
I'm refactoring BlockingBinaryEncoder to share code with the new BinaryEncoder
and be a little simpler. I don't intend to change its behavior but it would
help to know more about the details of encoding 'too large to buffer' array
values.
> Java: Improve BinaryEncoder Performance
> ----------------------------------------
>
> Key: AVRO-753
> URL: https://issues.apache.org/jira/browse/AVRO-753
> Project: Avro
> Issue Type: Improvement
> Components: java
> Reporter: Scott Carey
> Assignee: Scott Carey
> Fix For: 1.5.0
>
> Attachments: AVRO-753.v1.patch
>
>
> BinaryEncoder has not had a performance improvement pass like BinaryDecoder
> did. It still mostly writes directly to the underlying OutputStream which is
> not optimal for performance. I like to use a rule that if you are writing to
> an OutputStream or reading from an InputStream in chunks smaller than 128
> bytes, you have a performance problem.
> Measurements indicate that optimizing BinaryEncoder yields a 2.5x to 6x
> performance improvement. The process is significantly simpler than
> BinaryDecoder because 'pushing' is easier than 'pulling' -- and also because
> we do not need a 'direct' variant because BinaryEncoder already buffers
> sometimes.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira