[jira] Commented: (AVRO-753) Java: Improve BinaryEncoder Performance

Scott Carey (JIRA) Wed, 09 Feb 2011 11:03:24 -0800

    [ 
https://issues.apache.org/jira/browse/AVRO-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992661#comment-12992661
 ]


Scott Carey commented on AVRO-753:
----------------------------------

I need to change BlockingBinaryEncoder as part of this process.  It appears 
that I can simplify it significantly since both the new BinaryEncoder and the 
blocking variant will need to buffer data in a similar way and thus they will 
share a lot more code.

I want to clarify how it should work. The 
[spec|http://avro.apache.org/docs/current/spec.html#binary_encoding] doesn't 
seem to have the answer.  
It says: "If a block's count is negative, its absolute value is used, and the 
count is followed immediately by a long block size indicating the number of 
bytes in the block."  and "The blocked representation permits one to read and 
write arrays larger than can be buffered in memory, since one can start writing 
items without knowing the full length of the array."

If you need to write the length of the block, how can you write without knowing 
the full length of the array?  Looking at the code, it mentions this:

{quote}
"Regular" blocks have a non-zero byte count.
 "Overflow" blocks help us deal with the case where a block
 contains a value that's too big to buffer.  In this case, the
 block contains only one item, and we give it an unknown
 byte-count.  Because these values (1,unknown) are fixed, we're
 able to write the header for these overflow blocks to the
 underlying stream without seeing the entire block.  After writing
 this header, we've freed our buffer space to be fully devoted to
 blocking the large, inner value.
{quote}

The spec does not mention that a block can have an 'unknown' byte count.  Is 
this something that should be added to the spec?  Or is it somewhere else that 
I did not notice?

The code indicates that an 'overflow' block has one item (count = -1) and size 
= 0.  That seems a little ambiguous, "-1, -1" would make more sense since a 
negative size is impossible.  A valid record can have size zero if it only 
contains null fields.  

I'm refactoring BlockingBinaryEncoder to share code with the new BinaryEncoder 
and be a little simpler.  I don't intend to change its behavior but it would 
help to know more about the details of encoding 'too large to buffer' array 
values.



> Java:  Improve BinaryEncoder Performance
> ----------------------------------------
>
>                 Key: AVRO-753
>                 URL: https://issues.apache.org/jira/browse/AVRO-753
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>             Fix For: 1.5.0
>
>         Attachments: AVRO-753.v1.patch
>
>
> BinaryEncoder has not had a performance improvement pass like BinaryDecoder 
> did.  It still mostly writes directly to the underlying OutputStream which is 
> not optimal for performance.  I like to use a rule that if you are writing to 
> an OutputStream or reading from an InputStream in chunks smaller than 128 
> bytes, you have a performance problem.
> Measurements indicate that optimizing BinaryEncoder yields a 2.5x to 6x 
> performance improvement.  The process is significantly simpler than 
> BinaryDecoder because 'pushing' is easier than 'pulling' -- and also because 
> we do not need a 'direct' variant because BinaryEncoder already buffers 
> sometimes.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (AVRO-753) Java: Improve BinaryEncoder Performance

Reply via email to