What about using apache airflow for creating a DAG of transformer operators?
On Thu, Feb 6, 2020 at 8:25 PM vino yang <yanghua1...@gmail.com> wrote: > Currently, Hudi has a component that has not been widely used: Transformer. > As we all know, before the original data fell into the data lake, a very > common operation is data preprocessing and ETL. This is also the most > common use scenario of many computing engines, such as Flink and Spark. Now > that Hudi has taken advantage of the power of the computing engine, it can > also naturally take advantage of its ability of data preprocessing. We can > refactor the Transformer to make it become more flexible. To summarize, we > can refactor from the following aspects: > > - Decouple Transformer from Spark > - Enrich the Transformer and provide built-in transformer > - Support Transformer-chain > > For the first point, the Transformer interface is tightly coupled with > Spark in design, and it contains a Spark-specific context. This makes it > impossible for us to take advantage of the transform capabilities provided > by other engines (such as Flink) after supporting multiple engines. > Therefore, we need to decouple it from Spark in design. > > For the second point, we can enhance the Transformer and provide some > out-of-the-box Transformers, such as FilterTransformer, FlatMapTrnasformer, > and so on. > > For the third point, the most common pattern for data processing is the > pipeline model, and the common implementation of the pipeline model is the > responsibility chain model, which can be compared to the Apache commons > chain[1], combining multiple Transformers can make data-processing become > more flexible and expandable. > > If we enhance the capabilities of Transformer components, Hudi will provide > richer data processing capabilities based on the computing engine. > > What do you think? > > Any opinions and feedback are welcome and appreciated. > > Best, > Vino > > [1]: https://commons.apache.org/proper/commons-chain/ >