[
https://issues.apache.org/jira/browse/AVRO-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712299#action_12712299
]
Doug Cutting commented on AVRO-25:
----------------------------------
Finally looking at this. Overall it looks good. Some thoughts:
- can we get a version of the patch that integrates this into Avro, changing
GenericDatumReader/Writer to bracket arrays and maps, so that we can run unit
tests and benchmarks against this? We might then, e.g., change TestSchema and
TestDataFile to test things with the blocked implementations.
- let's not rename ValueReader and ValueWriter here, since that will just make
this patch harder to maintain.
- there are now four copies of the zig-zag varint encoding logic, two of the
float-writing logic, etc. it'd be nice to just have one of each. these are
innermost loops, and it'd be best to be able to optimize them in a single place.
- let's not maintain a stack in the basic ValueWriter to validate output -- we
don't elsewhere validate things, especially if it adds cost.
Here are some ideas for reducing code duplication: if BlockingValueWriter
extends ValueWriter then it can override each writeFoo method with something
like 'super.writeFoo(foo); check();', getting rid of one duplication. it would
also override write(byte) to add things to the buffer. The duplication in
writeLongDirect can be avoided by wrapping BlockingValueWriter's output stream
in a ValueWriter. Similarly, the duplication in encodeLong can be removed by
using a ValueWriter that writes to a ByteArrayOutputStream. We can reuse the
same ValueWriter/ByteArrayOutputStream pair, in the style of Hadoop's
DataOutputBuffer.
http://svn.apache.org/repos/asf/hadoop/core/trunk/src/core/org/apache/hadoop/io/DataOutputBuffer.java
> Blocking for value output (with API change)
> -------------------------------------------
>
> Key: AVRO-25
> URL: https://issues.apache.org/jira/browse/AVRO-25
> Project: Avro
> Issue Type: Improvement
> Components: java
> Reporter: Raymie Stata
> Assignee: Thiruvalluvan M. G.
> Attachments: AVRO-25.patch
>
>
> The Avro specification has provisions for decomposing very large arrays and
> maps into "blocks." These provisions allow for streaming implementations
> that would allow one to, for example, write the contents of a file out as an
> Avro array w/out knowing in advance how many records are in the file.
> The current Java implementation of Avro does support this provision. My
> colleague Thiru will be attaching a patch which implements blocking. It
> turns out that the buffering required to do blocking is non-trivial, so it
> seem beneficial to include a standard implementation of blocking as part of
> the reference Avro implementation.
> This is an early version of the code. We are still working on testing and
> performance tuning. But we wanted early feedback.
> This patch also includes a new set of classes called ValueInput and
> ValueOutput, which are meant to replace ValueReader and ValueWriter. These
> classes have largely the same API as ValueReader/Writer, but they include a
> few more methods to "bracket" items that appear inside of arrays and maps.
> Shortly, we'll be posting a separate patch which implements further
> subclasses of ValueInput/Output that do "validation" of input and output
> against a schema (and also do automatic schema resolution for readers).
> We're implementing these classes separate from ValueInput/Output to allow you
> to kick our tires w/out causing too much disruption to your source trees.
> Let's validate the basic idea behind these patches first, and then determine
> the details of integrating them into the rest of Avro.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.