I think it is a good idea to decouple the transformer from spark so that it can be used with other flow engines. Once you do that, then it is worth considering a much bigger play rather than another incremental play. Given the scale of Hudi, we need to look at airflow, particularly in the context of what google is doing with Composer, addressing autoscaling, scheduleing, monitoring, etc. You need all of that to manage a serious tetl/elt flow.
On Thu, Feb 6, 2020 at 8:25 PM vino yang <[email protected]> wrote: > Currently, Hudi has a component that has not been widely used: Transformer. > As we all know, before the original data fell into the data lake, a very > common operation is data preprocessing and ETL. This is also the most > common use scenario of many computing engines, such as Flink and Spark. Now > that Hudi has taken advantage of the power of the computing engine, it can > also naturally take advantage of its ability of data preprocessing. We can > refactor the Transformer to make it become more flexible. To summarize, we > can refactor from the following aspects: > > - Decouple Transformer from Spark > - Enrich the Transformer and provide built-in transformer > - Support Transformer-chain > > For the first point, the Transformer interface is tightly coupled with > Spark in design, and it contains a Spark-specific context. This makes it > impossible for us to take advantage of the transform capabilities provided > by other engines (such as Flink) after supporting multiple engines. > Therefore, we need to decouple it from Spark in design. > > For the second point, we can enhance the Transformer and provide some > out-of-the-box Transformers, such as FilterTransformer, FlatMapTrnasformer, > and so on. > > For the third point, the most common pattern for data processing is the > pipeline model, and the common implementation of the pipeline model is the > responsibility chain model, which can be compared to the Apache commons > chain[1], combining multiple Transformers can make data-processing become > more flexible and expandable. > > If we enhance the capabilities of Transformer components, Hudi will provide > richer data processing capabilities based on the computing engine. > > What do you think? > > Any opinions and feedback are welcome and appreciated. > > Best, > Vino > > [1]: https://commons.apache.org/proper/commons-chain/ >
