Re: Schema Discovery Support in Apex Applications

Chinmay Kolhatkar Wed, 25 Jan 2017 01:48:32 -0800

Thank you all for the feedback.

I've created a Jira for this: APEXCORE-623 and I'll attach the same
document and link to this mailchain there.


As a first part of this Jira, there are 2 steps I would like to propose:
1. Add following interface at com.datatorrent.common.util.SchemaAware.

interface SchemaAware {

Map<OutputPort, Schema> registerSchema(Map<InputPort, Schema> inputSchema);
}

This interface can be implemented by Operators to communicate its output
schema(s) to engine.
Input to this schema will be schema at its input port.

2. After LogicalPlan is created call SchemaAware method from upstream to
downstream operator in the DAG to propagate the Schema.

Once this is done, changes can be done in Malhar for the operators in
question.

Please share your opinion on this approach.

Thanks,
Chinmay.




On Wed, Jan 18, 2017 at 2:31 PM, Priyanka Gugale <pri...@apache.org> wrote:

> +1 to have this feature.
>
> -Priyanka
>
> On Tue, Jan 17, 2017 at 9:18 PM, Pramod Immaneni <pra...@datatorrent.com>
> wrote:
>
> > +1
> >
> > On Mon, Jan 16, 2017 at 1:23 AM, Chinmay Kolhatkar <chin...@apache.org>
> > wrote:
> >
> > > Hi All,
> > >
> > > Currently a DAG that is generated by user, if contains any POJOfied
> > > operators, TUPLE_CLASS attribute needs to be set on each and every port
> > > which receives or sends a POJO.
> > >
> > > For e.g., if a DAG is like File -> Parser -> Transform -> Dedup ->
> > > Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set by user
> on
> > > both input and output ports of transform, dedup operators and also on
> > > parser output and formatter input.
> > >
> > > The proposal here is to reduce work that is required by user to
> configure
> > > the DAG. Technically speaking if an operators knows input schema and
> > > processing properties, it can determine output schema and convey it to
> > > downstream operators. This way the complete pipeline can be configured
> > > without user setting TUPLE_CLASS or even creating POJOs and adding them
> > to
> > > classpath.
> > >
> > > On the same idea, I want to propose an approach where the pipeline can
> be
> > > configured without user setting TUPLE_CLASS or even creating POJOs and
> > > adding them to classpath.
> > > Here is the document which at a high level explains the idea and a high
> > > level design:
> > > https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_
> > > tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing
> > >
> > > I would like to get opinion from community about feasibility and
> > > applications of this proposal.
> > > Once we get some consensus we can discuss the design in details.
> > >
> > > Thanks,
> > > Chinmay.
> > >
> >
>

Re: Schema Discovery Support in Apex Applications

Reply via email to