Thanks Hamid and Vinoyang for the great discussion On Fri, Feb 14, 2020 at 5:18 AM vino yang <[email protected]> wrote:
> I have filed a Jira issue[1] to track this work. > > [1]: https://issues.apache.org/jira/browse/HUDI-613 > > vino yang <[email protected]> 于2020年2月13日周四 下午9:51写道: > > > Hi hamid, > > > > Agree with your opinion. > > > > Let's move forward step by step. > > > > Will file an issue to track refactor about Transformer. > > > > Best, > > Vino > > > > hamid pirahesh <[email protected]> 于2020年2月13日周四 下午6:38写道: > > > >> I think it is a good idea to decouple the transformer from spark so > that > >> it can be used with other flow engines. > >> Once you do that, then it is worth considering a much bigger play rather > >> than another incremental play. > >> Given the scale of Hudi, we need to look at airflow, particularly in the > >> context of what google is doing with Composer, addressing autoscaling, > >> scheduleing, monitoring, etc. > >> You need all of that to manage a serious tetl/elt flow. > >> > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang <[email protected]> wrote: > >> > >> > Currently, Hudi has a component that has not been widely used: > >> Transformer. > >> > As we all know, before the original data fell into the data lake, a > very > >> > common operation is data preprocessing and ETL. This is also the most > >> > common use scenario of many computing engines, such as Flink and > Spark. > >> Now > >> > that Hudi has taken advantage of the power of the computing engine, it > >> can > >> > also naturally take advantage of its ability of data preprocessing. We > >> can > >> > refactor the Transformer to make it become more flexible. To > summarize, > >> we > >> > can refactor from the following aspects: > >> > > >> > - Decouple Transformer from Spark > >> > - Enrich the Transformer and provide built-in transformer > >> > - Support Transformer-chain > >> > > >> > For the first point, the Transformer interface is tightly coupled with > >> > Spark in design, and it contains a Spark-specific context. This makes > it > >> > impossible for us to take advantage of the transform capabilities > >> provided > >> > by other engines (such as Flink) after supporting multiple engines. > >> > Therefore, we need to decouple it from Spark in design. > >> > > >> > For the second point, we can enhance the Transformer and provide some > >> > out-of-the-box Transformers, such as FilterTransformer, > >> FlatMapTrnasformer, > >> > and so on. > >> > > >> > For the third point, the most common pattern for data processing is > the > >> > pipeline model, and the common implementation of the pipeline model is > >> the > >> > responsibility chain model, which can be compared to the Apache > commons > >> > chain[1], combining multiple Transformers can make data-processing > >> become > >> > more flexible and expandable. > >> > > >> > If we enhance the capabilities of Transformer components, Hudi will > >> provide > >> > richer data processing capabilities based on the computing engine. > >> > > >> > What do you think? > >> > > >> > Any opinions and feedback are welcome and appreciated. > >> > > >> > Best, > >> > Vino > >> > > >> > [1]: https://commons.apache.org/proper/commons-chain/ > >> > > >> > > >
