Thanks Hamid and Vinoyang for the great discussion

On Fri, Feb 14, 2020 at 5:18 AM vino yang <[email protected]> wrote:

> I have filed a Jira issue[1] to track this work.
>
> [1]: https://issues.apache.org/jira/browse/HUDI-613
>
> vino yang <[email protected]> 于2020年2月13日周四 下午9:51写道:
>
> > Hi hamid,
> >
> > Agree with your opinion.
> >
> > Let's move forward step by step.
> >
> > Will file an issue to track refactor about Transformer.
> >
> > Best,
> > Vino
> >
> > hamid pirahesh <[email protected]> 于2020年2月13日周四 下午6:38写道:
> >
> >> I think it is a good idea to decouple  the transformer from spark so
> that
> >> it can be used with other flow engines.
> >> Once you do that, then it is worth considering a much bigger play rather
> >> than another incremental play.
> >> Given the scale of Hudi, we need to look at airflow, particularly in the
> >> context of what google is doing with Composer, addressing autoscaling,
> >> scheduleing, monitoring, etc.
> >> You need all of that to manage a serious tetl/elt flow.
> >>
> >> On Thu, Feb 6, 2020 at 8:25 PM vino yang <[email protected]> wrote:
> >>
> >> > Currently, Hudi has a component that has not been widely used:
> >> Transformer.
> >> > As we all know, before the original data fell into the data lake, a
> very
> >> > common operation is data preprocessing and ETL. This is also the most
> >> > common use scenario of many computing engines, such as Flink and
> Spark.
> >> Now
> >> > that Hudi has taken advantage of the power of the computing engine, it
> >> can
> >> > also naturally take advantage of its ability of data preprocessing. We
> >> can
> >> > refactor the Transformer to make it become more flexible. To
> summarize,
> >> we
> >> > can refactor from the following aspects:
> >> >
> >> >    - Decouple Transformer from Spark
> >> >    - Enrich the Transformer and provide built-in transformer
> >> >    - Support Transformer-chain
> >> >
> >> > For the first point, the Transformer interface is tightly coupled with
> >> > Spark in design, and it contains a Spark-specific context. This makes
> it
> >> > impossible for us to take advantage of the transform capabilities
> >> provided
> >> > by other engines (such as Flink) after supporting multiple engines.
> >> > Therefore, we need to decouple it from Spark in design.
> >> >
> >> > For the second point, we can enhance the Transformer and provide some
> >> > out-of-the-box Transformers, such as FilterTransformer,
> >> FlatMapTrnasformer,
> >> > and so on.
> >> >
> >> > For the third point, the most common pattern for data processing is
> the
> >> > pipeline model, and the common implementation of the pipeline model is
> >> the
> >> > responsibility chain model, which can be compared to the Apache
> commons
> >> > chain[1], combining multiple Transformers can make data-processing
> >> become
> >> > more flexible and expandable.
> >> >
> >> > If we enhance the capabilities of Transformer components, Hudi will
> >> provide
> >> > richer data processing capabilities based on the computing engine.
> >> >
> >> > What do you think?
> >> >
> >> > Any opinions and feedback are welcome and appreciated.
> >> >
> >> > Best,
> >> > Vino
> >> >
> >> > [1]: https://commons.apache.org/proper/commons-chain/
> >> >
> >>
> >
>

Reply via email to