Re: Refactor and enhance Hudi Transformer

vino yang Fri, 07 Feb 2020 18:11:55 -0800

Hi hamid,

AFAIK, currently, Transformer works as a task(Spark task) in the context of
hudi reading data.
It's not a single component out of Hudi.
Can you describe more details about how to use Apache Airflow?
I personally suggest that we have a premise here: our goal is to enhance
the data preprocessing capabilities of hudi.


Best,
Vino


hamid pirahesh <hpirah...@gmail.com> 于2020年2月8日周六 上午4:23写道：

> What about using apache airflow for creating a DAG of
> transformer operators?
>
> On Thu, Feb 6, 2020 at 8:25 PM vino yang <yanghua1...@gmail.com> wrote:
>
> > Currently, Hudi has a component that has not been widely used:
> Transformer.
> > As we all know, before the original data fell into the data lake, a very
> > common operation is data preprocessing and ETL. This is also the most
> > common use scenario of many computing engines, such as Flink and Spark.
> Now
> > that Hudi has taken advantage of the power of the computing engine, it
> can
> > also naturally take advantage of its ability of data preprocessing. We
> can
> > refactor the Transformer to make it become more flexible. To summarize,
> we
> > can refactor from the following aspects:
> >
> >    - Decouple Transformer from Spark
> >    - Enrich the Transformer and provide built-in transformer
> >    - Support Transformer-chain
> >
> > For the first point, the Transformer interface is tightly coupled with
> > Spark in design, and it contains a Spark-specific context. This makes it
> > impossible for us to take advantage of the transform capabilities
> provided
> > by other engines (such as Flink) after supporting multiple engines.
> > Therefore, we need to decouple it from Spark in design.
> >
> > For the second point, we can enhance the Transformer and provide some
> > out-of-the-box Transformers, such as FilterTransformer,
> FlatMapTrnasformer,
> > and so on.
> >
> > For the third point, the most common pattern for data processing is the
> > pipeline model, and the common implementation of the pipeline model is
> the
> > responsibility chain model, which can be compared to the Apache commons
> > chain[1], combining multiple Transformers can make data-processing become
> > more flexible and expandable.
> >
> > If we enhance the capabilities of Transformer components, Hudi will
> provide
> > richer data processing capabilities based on the computing engine.
> >
> > What do you think?
> >
> > Any opinions and feedback are welcome and appreciated.
> >
> > Best,
> > Vino
> >
> > [1]: https://commons.apache.org/proper/commons-chain/
> >
>

Re: Refactor and enhance Hudi Transformer

Reply via email to