Hi All,

Currently a DAG that is generated by user, if contains any POJOfied
operators, TUPLE_CLASS attribute needs to be set on each and every port
which receives or sends a POJO.

For e.g., if a DAG is like File -> Parser -> Transform -> Dedup ->
Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set by user on
both input and output ports of transform, dedup operators and also on
parser output and formatter input.

The proposal here is to reduce work that is required by user to configure
the DAG. Technically speaking if an operators knows input schema and
processing properties, it can determine output schema and convey it to
downstream operators. This way the complete pipeline can be configured
without user setting TUPLE_CLASS or even creating POJOs and adding them to
classpath.

On the same idea, I want to propose an approach where the pipeline can be
configured without user setting TUPLE_CLASS or even creating POJOs and
adding them to classpath.
Here is the document which at a high level explains the idea and a high
level design:
https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing

I would like to get opinion from community about feasibility and
applications of this proposal.
Once we get some consensus we can discuss the design in details.

Thanks,
Chinmay.

Reply via email to