[
https://issues.apache.org/jira/browse/AVRO-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065402#comment-13065402
]
Scott Carey commented on AVRO-859:
----------------------------------
h4. Functional Composition
All read and write operations can be broken into functional bits and composed
rather than writing monolithic classes. This allows a "DatumWriter2" to be a
graph of functions that pre-compute all state required from a schema rather
than traverse a schema for each write. Additionally, if the functions are all
of a common set of types, it becomes easy to use code generation: either
directly or by parsing the resulting function graph and converting to code that
the JVM can better optimize.
h4. Symmetry
Avro's data flow can be made symmetric. Rather than thinking in terms of Read
and Write, think in terms of:
* _*Source*_: Where data that is represented by an Avro schema comes from --
this may be a Decoder, or an Object graph.
* _*Target*_: Where data that represents an Avro schema is sent -- this may be
an Encoder or an Object graph.
Combine the two ideas together and you can create _*Flows*_ -- The combination
of a Source and a Target for a specific Schema (or resolvable Schema pair).
The machinery that requires traversing and resolving schemas can be written
once, and "DatumReader" written once, with different source and targets
combined to make different tools:
* An Decoder source + GenericData target = GenericDatumReader
* A SpecificData source + Encoder target = GenericDatumWriter
* BinaryDecoder source + JsonEncoder target = transform from binary to json
without any intermediate objects!
* SpecificData source + GeneridData target = transform one object type to
another
Add in new sources and targets (Pig, ProtoBuf, Thrift objects; Pig binary,
Protobuf binary, Thrift binary) and you can mix/match more transformation tasks.
Additinally, one can write a generic Equals/Compare implementation that takes
two Sources, and compares them or checks for equality. Then, you can compare
binary with an object, or two objects.
Data flow could also tee: one source with many targets.
h4. Functional units
After much prototyping and desingn, I have identified that all Avro data flow
can be done by the composition of two functors:
The Unary Functor, which I have named *Access*:
{code}
Access<A,B> {
B access(A a);
}
{code}
And a Binary Functor with two types named *Flow*:
{code}
Flow<A,B> {
B flow(A a, B b);
}
{code}
In most cases, you can replace "A" with "FROM" and "B" with "TO" in relation to
Target and Source concepts. These functions can naturally compose in all the
ways required for data to flow from a target to a source.
.h4 Making Symmetry
Consider this simple example, a Flow over the schema:
{code}
{"type": "record", "name":"Foo", "fields":
[{"type":"int"}]
}
{code}
In the current implementation, a GenericDatumReader has the following API:
{code}
D read(D reuse, Decoder in);
{code}
which internally parses a Schema step by step, recursively calling methods with
a similar signature.
When we get to the leaf field, we return an integer, and on return insert that
into a GenericData.Record as the first field.
A very similar process occurs with GenericDatumWriter:
{code}
void write(D datum, Encoder out);
{code}
Which traverses a schema, recursively calling methods with a similar signature.
On the way down the schema graph, we access objects and pass portions of the
data through, and when we hit the leaf field, we write it to the encoder and
return.
Consider the innermost operation for both of the above:
Fetch an integer, then put it somewhere:
|| step || Source || Target || Source op || Target op || flow signature ||
| read an integer | IndexedRecord | Encoder | IndexedRecord.get() | (null) |
int access(IndexedRecord) |
| read an integer | Decoder | IndexedRecord | Decoder.readInt() | (null) | int
access(Decoder) |
| send integer to output | IndexedRecord | Encoder | (null) |
Encoder.writeInt() | Encoder flow(int, Encoder) |
| send integer to output | Decoder | IndexedRecord | (null) |
IndexedRecord.put() | IndexedRecord flow(int, IndexedRecord) |
The access and flow signatures compose as follows:
{code}
int access(A);
FollowedBy
B flow(int, B);
Equals:
B flow(A, B);
{code}
So the above two examples compose to:
|| step || Source || Target || Source op || Target op || flow signature ||
| int flow | IndexedRecord | Encoder | IndexedRecord.get() | Encoder.writeInt()
| Encoder flow(IndexedRecord, Encoder) |
| int flow | Decoder | IndexedRecord | Decoder.readInt() | IndexedRecord.put()
| IndexedRecord flow(Decoder, IndexedRecord) |
As can be seen, one can compose the following two functions for an integer
field, one function provided by the Source, and one function provided by the
Target, and produce a Flow of data between them.
The source and target each have their own contexts -- the object types that an
integer field represents -- but to not have to know anything about the other
side. The flow composition also does not need any information about the source
or target -- they meet only at "int".
> Java: Data Flow Overhaul -- Composition and Symmetry
> ----------------------------------------------------
>
> Key: AVRO-859
> URL: https://issues.apache.org/jira/browse/AVRO-859
> Project: Avro
> Issue Type: New Feature
> Components: java
> Reporter: Scott Carey
> Assignee: Scott Carey
>
> Data flow in Avro is currently broken into two parts: Read and Write. These
> share many common patterns but almost no common code.
> Additionally, the APIs for this are DatumReader and DatumWriter, which
> requires that implementations know how to traverse Schemas and use the
> Resolver.
> This is a proposal to overhaul the inner workings of Avro Java between the
> Decoder/Encoder APIs and DatumReader/DatumWriter such that there is
> significantly more code re-use and much greater opportunity for new features
> that can all share in general optimizations and dynamic code generation.
> The two primary concepts involved are:
> * _*Functional Composition*_
> * _*Symmetry*_
> h4. Functional Composition
> All read and write operations can be broken into functional bits and composed
> rather than writing monolithic classes. This allows a "DatumWriter2" to be a
> graph of functions that pre-compute all state required from a schema rather
> than traverse a schema for each write.
> h4. Symmetry
> Avro's data flow can be made symmetric. Rather than thinking in terms of
> Read and Write, think in terms of:
> * _*Source*_: Where data that is represented by an Avro schema comes from --
> this may be a Decoder, or an Object graph.
> * _*Target*_: Where data that represents an Avro schema is sent -- this may
> be an Encoder or an Object graph.
> (More detail in the comments)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira