[jira] [Commented] (AVRO-859) Java: Data Flow Overhaul -- Composition and Symmetry

Scott Carey (JIRA) Thu, 21 Jul 2011 11:19:25 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069114#comment-13069114
 ]


Scott Carey commented on AVRO-859:
----------------------------------

I really need to look more deeply at the changes you have done in the C 
implementation.  I started my work thinking in terms of "push" and "pull" and 
was doing something that sounds similar to your description of the "consumer" 
form AVRO-762.  The result was an implementation of Writing that was much 
faster than the current implementation and based on functional composition -- 
it composed functions with signatures like:
{code}
void send(FROM f, TO t);
{code}
Implementing readers turned out to be more difficult and there was much code 
duplication and no symmetry.  Requests for features on the user mailing list 
included things like converting Specific objects to Generic ones, and that got 
me thinking about splitting read/write up into symmetric components.  This 
isn't very easy, especially for Maps and Arrays.  Records and Unions aren't so 
tough, they turn into 'composite' and 'branch' flows fairly easily.

For "push" versus "pull" I have come to the realization that you can mix the 
two if you define the boundary very carefully and use the "flow" function, 
which is a mix of both.
"Push" in general is easier, but at the lowest level you must pull and then 
invert that into a push.  The Access functor has a method on that "thenFlow" to 
change a pull to a push.
{code}
abstract class Access<FROM, T> {
  T access(FROM f);
  Flow<FROM, NEXT> thenFlow(Flow<T, NEXT> then);
}
{code}

And this is where "source" and "target" meet in most cases -- the FlowFactory 
takes the Source Access functor, and creates a composite flow from the target 
flow functor -- the two match because the common type is T, which is determined 
by the schema node.

For everything but Arrays/Maps I have a working implementation that I'm hoping 
to submit here soon, but Arrays/Maps (especially maps) have turned out 
trickier.  I will make them work in a slightly less elegant way (Source 
implementations will have to loop over their type and trigger callbacks in a 
special target callback, rather than composing functors).

I wish I could use Scala here... about every 6 lines of Java would reduce to 1 
in Scala for the function type definitions.

For schema resolution, everything but record re-ordering will be easy -- 
requiring simply one more functor or replacing a functor to transform a type, 
with the addition of "skip" functors for the source.
For record re-ordering I will need a tag type that specifies whether a source 
or target requires field order or not.  If either side is 'unordered' then it 
is simple. If both sides require order, a buffer will be required.  This buffer 
can be generic so that no source or target implementations have to worry about 
it other than declaring whether they require order, but it is non-trivial. 

Likewise, default values will need some work. In the Java implementation they 
are handled by storing a Jackson JSON node with the default.  This is not 
ideal.  It would be beneficial to convert default values to the most efficient 
representation that a Target would need to insert it when the source does not 
have the value.

> Java: Data Flow Overhaul -- Composition and Symmetry
> ----------------------------------------------------
>
>                 Key: AVRO-859
>                 URL: https://issues.apache.org/jira/browse/AVRO-859
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>
> Data flow in Avro is currently broken into two parts:  Read and Write.  These 
> share many common patterns but almost no common code.  
> Additionally, the APIs for this are DatumReader and DatumWriter, which 
> requires that implementations know how to traverse Schemas and use the 
> Resolver.
> This is a proposal to overhaul the inner workings of Avro Java between the 
> Decoder/Encoder APIs and DatumReader/DatumWriter such that there is 
> significantly more code re-use and much greater opportunity for new features 
> that can all share in general optimizations and dynamic code generation.
> The two primary concepts involved are:
> * _*Functional Composition*_
> * _*Symmetry*_
> h4. Functional Composition
> All read and write operations can be broken into functional bits and composed 
> rather than writing monolithic classes.  This allows a "DatumWriter2" to be a 
> graph of functions that pre-compute all state required from a schema rather 
> than traverse a schema for each write.
> h4. Symmetry
> Avro's data flow can be made symmetric.  Rather than thinking in terms of 
> Read and Write, think in terms of:
> * _*Source*_: Where data that is represented by an Avro schema comes from -- 
> this may be a Decoder, or an Object graph.
> * _*Target*_: Where data that represents an Avro schema is sent -- this may 
> be an Encoder or an Object graph.
> (More detail in the comments)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-859) Java: Data Flow Overhaul -- Composition and Symmetry

Reply via email to