What about using apache airflow for creating a DAG of
transformer operators?

On Thu, Feb 6, 2020 at 8:25 PM vino yang <yanghua1...@gmail.com> wrote:

> Currently, Hudi has a component that has not been widely used: Transformer.
> As we all know, before the original data fell into the data lake, a very
> common operation is data preprocessing and ETL. This is also the most
> common use scenario of many computing engines, such as Flink and Spark. Now
> that Hudi has taken advantage of the power of the computing engine, it can
> also naturally take advantage of its ability of data preprocessing. We can
> refactor the Transformer to make it become more flexible. To summarize, we
> can refactor from the following aspects:
>
>    - Decouple Transformer from Spark
>    - Enrich the Transformer and provide built-in transformer
>    - Support Transformer-chain
>
> For the first point, the Transformer interface is tightly coupled with
> Spark in design, and it contains a Spark-specific context. This makes it
> impossible for us to take advantage of the transform capabilities provided
> by other engines (such as Flink) after supporting multiple engines.
> Therefore, we need to decouple it from Spark in design.
>
> For the second point, we can enhance the Transformer and provide some
> out-of-the-box Transformers, such as FilterTransformer, FlatMapTrnasformer,
> and so on.
>
> For the third point, the most common pattern for data processing is the
> pipeline model, and the common implementation of the pipeline model is the
> responsibility chain model, which can be compared to the Apache commons
> chain[1], combining multiple Transformers can make data-processing become
> more flexible and expandable.
>
> If we enhance the capabilities of Transformer components, Hudi will provide
> richer data processing capabilities based on the computing engine.
>
> What do you think?
>
> Any opinions and feedback are welcome and appreciated.
>
> Best,
> Vino
>
> [1]: https://commons.apache.org/proper/commons-chain/
>

Reply via email to