[jira] [Commented] (AVRO-859) Java: Data Flow Overhaul -- Composition and Symmetry

Scott Carey (JIRA) Thu, 14 Jul 2011 10:31:27 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065402#comment-13065402
 ]


Scott Carey commented on AVRO-859:
----------------------------------

h4. Functional Composition
All read and write operations can be broken into functional bits and composed 
rather than writing monolithic classes.  This allows a "DatumWriter2" to be a 
graph of functions that pre-compute all state required from a schema rather 
than traverse a schema for each write.  Additionally, if the functions are all 
of a common set of types, it becomes easy to use code generation:  either 
directly or by parsing the resulting function graph and converting to code that 
the JVM can better optimize.

h4. Symmetry
Avro's data flow can be made symmetric.  Rather than thinking in terms of Read 
and Write, think in terms of:
* _*Source*_: Where data that is represented by an Avro schema comes from -- 
this may be a Decoder, or an Object graph.
* _*Target*_: Where data that represents an Avro schema is sent -- this may be 
an Encoder or an Object graph.

Combine the two ideas together and you can create _*Flows*_ -- The combination 
of a Source and a Target for a specific Schema (or resolvable Schema pair).
The machinery that requires traversing and resolving schemas can be written 
once, and "DatumReader" written once, with different source and targets 
combined to make different tools:
* An Decoder source + GenericData target = GenericDatumReader
* A SpecificData source +  Encoder target = GenericDatumWriter
* BinaryDecoder source + JsonEncoder target = transform from binary to json 
without any intermediate objects!
* SpecificData source + GeneridData target = transform one object type to 
another

Add in new sources and targets (Pig, ProtoBuf, Thrift objects; Pig binary, 
Protobuf binary, Thrift binary) and you can mix/match more transformation tasks.

Additinally, one can write a generic Equals/Compare implementation that takes 
two Sources, and compares them or checks for equality.  Then, you can compare 
binary with an object, or two objects.
Data flow could also tee:  one source with many targets.



h4. Functional units
After much prototyping and desingn, I have identified that all Avro data flow 
can be done by the composition of two functors:
The Unary Functor, which I have named *Access*: 
{code}
Access<A,B> {
 B access(A a);
}
{code}
And a Binary Functor with two types named *Flow*:
{code}
Flow<A,B> {
 B flow(A a, B b);
}
{code}
In most cases, you can replace "A" with "FROM" and "B" with "TO" in relation to 
Target and Source concepts.  These functions can naturally compose in all the 
ways required for data to flow from a target to a source.

.h4 Making Symmetry
Consider this simple example, a Flow over the schema: 
{code}
{"type": "record", "name":"Foo", "fields":
  [{"type":"int"}]
}
{code}

In the current implementation, a GenericDatumReader has the following API:
{code}
D read(D reuse, Decoder in);
{code}
which internally parses a Schema step by step, recursively calling methods with 
a similar signature.
When we get to the leaf field, we return an integer, and on return insert that 
into a GenericData.Record as the first field.
A very similar process occurs with GenericDatumWriter:
{code}
void write(D datum, Encoder out);
{code}
Which traverses a schema, recursively calling methods with a similar signature.
On the way down the schema graph, we access objects and pass portions of the 
data through, and when we hit the leaf field, we write it to the encoder and 
return.

Consider the innermost operation for both of the above:
Fetch an integer, then put it somewhere:
|| step || Source || Target || Source op || Target op || flow signature ||
| read an integer | IndexedRecord | Encoder | IndexedRecord.get() | (null) | 
int access(IndexedRecord) |
| read an integer | Decoder | IndexedRecord | Decoder.readInt() | (null) | int 
access(Decoder) |
| send integer to output | IndexedRecord | Encoder | (null) | 
Encoder.writeInt() | Encoder flow(int, Encoder) |
| send integer to output | Decoder | IndexedRecord | (null) | 
IndexedRecord.put() | IndexedRecord flow(int, IndexedRecord) |

The access and flow signatures compose as follows:
{code}
int access(A);
 FollowedBy
B flow(int, B);
Equals:

B flow(A, B);
{code}

So the above two examples compose to:
|| step || Source || Target || Source op || Target op || flow signature ||
| int flow | IndexedRecord | Encoder | IndexedRecord.get() | Encoder.writeInt() 
| Encoder flow(IndexedRecord, Encoder) |
| int flow | Decoder | IndexedRecord | Decoder.readInt() | IndexedRecord.put() 
| IndexedRecord flow(Decoder, IndexedRecord) |

As can be seen, one can compose the following two functions for an integer 
field, one function provided by the Source, and one function provided by the 
Target, and produce a Flow of data between them.
The source and target each have their own contexts -- the object types that an 
integer field represents -- but to not have to know anything about the other 
side.  The flow composition also does not need any information about the source 
or target -- they meet only at "int".

> Java: Data Flow Overhaul -- Composition and Symmetry
> ----------------------------------------------------
>
>                 Key: AVRO-859
>                 URL: https://issues.apache.org/jira/browse/AVRO-859
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>
> Data flow in Avro is currently broken into two parts:  Read and Write.  These 
> share many common patterns but almost no common code.  
> Additionally, the APIs for this are DatumReader and DatumWriter, which 
> requires that implementations know how to traverse Schemas and use the 
> Resolver.
> This is a proposal to overhaul the inner workings of Avro Java between the 
> Decoder/Encoder APIs and DatumReader/DatumWriter such that there is 
> significantly more code re-use and much greater opportunity for new features 
> that can all share in general optimizations and dynamic code generation.
> The two primary concepts involved are:
> * _*Functional Composition*_
> * _*Symmetry*_
> h4. Functional Composition
> All read and write operations can be broken into functional bits and composed 
> rather than writing monolithic classes.  This allows a "DatumWriter2" to be a 
> graph of functions that pre-compute all state required from a schema rather 
> than traverse a schema for each write.
> h4. Symmetry
> Avro's data flow can be made symmetric.  Rather than thinking in terms of 
> Read and Write, think in terms of:
> * _*Source*_: Where data that is represented by an Avro schema comes from -- 
> this may be a Decoder, or an Object graph.
> * _*Target*_: Where data that represents an Avro schema is sent -- this may 
> be an Encoder or an Object graph.
> (More detail in the comments)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-859) Java: Data Flow Overhaul -- Composition and Symmetry

Reply via email to