[ 
https://issues.apache.org/jira/browse/AVRO-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712299#action_12712299
 ] 

Doug Cutting commented on AVRO-25:
----------------------------------

Finally looking at this.  Overall it looks good.  Some thoughts:
 - can we get a version of the patch that integrates this into Avro, changing 
GenericDatumReader/Writer to bracket arrays and maps, so that we can run unit 
tests and benchmarks against this?  We might then, e.g., change TestSchema and 
TestDataFile to test things with the blocked implementations.
 - let's not rename ValueReader and ValueWriter here, since that will just make 
this patch harder to maintain.
 - there are now four copies of the zig-zag varint encoding logic, two of the 
float-writing logic, etc.  it'd be nice to just have one of each.  these are 
innermost loops, and it'd be best to be able to optimize them in a single place.
 - let's not maintain a stack in the basic ValueWriter to validate output -- we 
don't elsewhere validate things, especially if it adds cost.

Here are some ideas for reducing code duplication: if BlockingValueWriter 
extends ValueWriter then it can override each writeFoo method with something 
like 'super.writeFoo(foo); check();', getting rid of one duplication.  it would 
also override write(byte) to add things to the buffer.  The duplication in 
writeLongDirect can be avoided by wrapping BlockingValueWriter's output stream 
in a ValueWriter.  Similarly, the duplication in encodeLong can be removed by 
using a ValueWriter that writes to a ByteArrayOutputStream.  We can reuse the 
same ValueWriter/ByteArrayOutputStream pair, in the style of Hadoop's 
DataOutputBuffer.

http://svn.apache.org/repos/asf/hadoop/core/trunk/src/core/org/apache/hadoop/io/DataOutputBuffer.java



> Blocking for value output (with API change)
> -------------------------------------------
>
>                 Key: AVRO-25
>                 URL: https://issues.apache.org/jira/browse/AVRO-25
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>            Reporter: Raymie Stata
>            Assignee: Thiruvalluvan M. G.
>         Attachments: AVRO-25.patch
>
>
> The Avro specification has provisions for decomposing very large arrays and 
> maps into "blocks."  These provisions allow for streaming implementations 
> that would allow one to, for example, write the contents of a file out as an 
> Avro array w/out knowing in advance how many records are in the file.
> The current Java implementation of Avro does support this provision.  My 
> colleague Thiru will be attaching a patch which implements blocking.  It 
> turns out that the buffering required to do blocking is non-trivial, so it 
> seem beneficial to include a standard implementation of blocking as part of 
> the reference Avro implementation.
> This is an early version of the code.  We are still working on testing and 
> performance tuning.  But we wanted early feedback.
> This patch also includes a new set of classes called ValueInput and 
> ValueOutput, which are meant to replace ValueReader and ValueWriter.  These 
> classes have largely the same API as ValueReader/Writer, but they include a 
> few more methods to "bracket" items that appear inside of arrays and maps.  
> Shortly, we'll be posting a separate patch which implements further 
> subclasses of ValueInput/Output that do "validation" of input and output 
> against a schema (and also do automatic schema resolution for readers).
> We're implementing these classes separate from ValueInput/Output to allow you 
> to kick our tires w/out causing too much disruption to your source trees.  
> Let's validate the basic idea behind these patches first, and then determine 
> the details of integrating them into the rest of Avro.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to