Hi hamid, Agree with your opinion.
Let's move forward step by step. Will file an issue to track refactor about Transformer. Best, Vino hamid pirahesh <[email protected]> 于2020年2月13日周四 下午6:38写道: > I think it is a good idea to decouple the transformer from spark so that > it can be used with other flow engines. > Once you do that, then it is worth considering a much bigger play rather > than another incremental play. > Given the scale of Hudi, we need to look at airflow, particularly in the > context of what google is doing with Composer, addressing autoscaling, > scheduleing, monitoring, etc. > You need all of that to manage a serious tetl/elt flow. > > On Thu, Feb 6, 2020 at 8:25 PM vino yang <[email protected]> wrote: > > > Currently, Hudi has a component that has not been widely used: > Transformer. > > As we all know, before the original data fell into the data lake, a very > > common operation is data preprocessing and ETL. This is also the most > > common use scenario of many computing engines, such as Flink and Spark. > Now > > that Hudi has taken advantage of the power of the computing engine, it > can > > also naturally take advantage of its ability of data preprocessing. We > can > > refactor the Transformer to make it become more flexible. To summarize, > we > > can refactor from the following aspects: > > > > - Decouple Transformer from Spark > > - Enrich the Transformer and provide built-in transformer > > - Support Transformer-chain > > > > For the first point, the Transformer interface is tightly coupled with > > Spark in design, and it contains a Spark-specific context. This makes it > > impossible for us to take advantage of the transform capabilities > provided > > by other engines (such as Flink) after supporting multiple engines. > > Therefore, we need to decouple it from Spark in design. > > > > For the second point, we can enhance the Transformer and provide some > > out-of-the-box Transformers, such as FilterTransformer, > FlatMapTrnasformer, > > and so on. > > > > For the third point, the most common pattern for data processing is the > > pipeline model, and the common implementation of the pipeline model is > the > > responsibility chain model, which can be compared to the Apache commons > > chain[1], combining multiple Transformers can make data-processing become > > more flexible and expandable. > > > > If we enhance the capabilities of Transformer components, Hudi will > provide > > richer data processing capabilities based on the computing engine. > > > > What do you think? > > > > Any opinions and feedback are welcome and appreciated. > > > > Best, > > Vino > > > > [1]: https://commons.apache.org/proper/commons-chain/ > > >
