Hi Vinoth, Thanks for summarizing both our opinions. Your summarize is good.
>> How about we first focus our discussion on how we can expand ETL support, rather than zooming on this interface (which is a lower value conversation IMO). Of course, Yes. About your summarize. The initial goal of this discussion topic was to talk about improving the Transformer component to help Hudi itself provide more powerful ETL capabilities. If it involves third-party frameworks (such as a scheduling engine), then its direction will shift to an "ecological" perspective, namely how to integrate with a third-party framework to better enable Hudi to support strong ETL capabilities. Of course, their goals are to develop in a good direction. Recently, I saw a data ingestion framework named marmaray[1] that open source by Uber (maybe you are familiar with this project). Hudi's Transformer is similar to the converter component it provides, so I launched this proposal to try to see if we can enhance the Transformer so that Hudi can have strong ETL characteristics while it does not rely on any other services. Of course, I absolutely agree with your second point. It is also necessary for us to provide convenience and flexibility for third parties to conduct ETL more conveniently. Best, Vino [1]: https://github.com/uber/marmaray Vinoth Chandar <[email protected]> 于2020年2月10日周一 下午4:10写道: > Thanks for kicking this discussion off... > > At a high level, improving the deltastreamer tool to be able to better > support ETL pipelines is a great goal and we can do a lot more here to > help. > > > Currently, Hudi has a component that has not been widely used: > Transformer. > Not true actually. The recent DMS integration was based off this and I know > of atleast two other users. At the end of the day, its very simple > interface that take a dataframe in and hands a dataframe out and it need > not be anymore complicated than that. On Flink, we can first get the > custom/handwritten pipeline working (ala hudi-spark),before we can extend > deltastreamer to flink. This is much larger effort. > > How about we first focus our discussion on how we can expand ETL support, > rather than zooming on this interface (which is a lower value conversation > IMO). > Here are some thoughts > > - (vinoyang's point) We could support a lot of standard transformations > built into Hudi itself? This can include common timestamp extraction, field > masking/filtering and such. > - (hamid's point) Airflow is a workflow scheduler and people can use it to > schedule DeltaStreamer jobs/Spark jobs. IMO its orthogonal/complementary to > transforms we would support ourselves. But, I think we can provide some > real value in implementing some way to trigger data pipelines in Airflow > based on a Hudi dataset receiving new commits.. for e.g we could run an > incremental ETL every time a new commit lands on the source Hudi table? > > Thanks > Vinoth > > > > > > On Fri, Feb 7, 2020 at 6:11 PM vino yang <[email protected]> wrote: > > > Hi hamid, > > > > AFAIK, currently, Transformer works as a task(Spark task) in the context > of > > hudi reading data. > > It's not a single component out of Hudi. > > Can you describe more details about how to use Apache Airflow? > > I personally suggest that we have a premise here: our goal is to enhance > > the data preprocessing capabilities of hudi. > > > > Best, > > Vino > > > > > > hamid pirahesh <[email protected]> 于2020年2月8日周六 上午4:23写道: > > > > > What about using apache airflow for creating a DAG of > > > transformer operators? > > > > > > On Thu, Feb 6, 2020 at 8:25 PM vino yang <[email protected]> > wrote: > > > > > > > Currently, Hudi has a component that has not been widely used: > > > Transformer. > > > > As we all know, before the original data fell into the data lake, a > > very > > > > common operation is data preprocessing and ETL. This is also the most > > > > common use scenario of many computing engines, such as Flink and > Spark. > > > Now > > > > that Hudi has taken advantage of the power of the computing engine, > it > > > can > > > > also naturally take advantage of its ability of data preprocessing. > We > > > can > > > > refactor the Transformer to make it become more flexible. To > summarize, > > > we > > > > can refactor from the following aspects: > > > > > > > > - Decouple Transformer from Spark > > > > - Enrich the Transformer and provide built-in transformer > > > > - Support Transformer-chain > > > > > > > > For the first point, the Transformer interface is tightly coupled > with > > > > Spark in design, and it contains a Spark-specific context. This makes > > it > > > > impossible for us to take advantage of the transform capabilities > > > provided > > > > by other engines (such as Flink) after supporting multiple engines. > > > > Therefore, we need to decouple it from Spark in design. > > > > > > > > For the second point, we can enhance the Transformer and provide some > > > > out-of-the-box Transformers, such as FilterTransformer, > > > FlatMapTrnasformer, > > > > and so on. > > > > > > > > For the third point, the most common pattern for data processing is > the > > > > pipeline model, and the common implementation of the pipeline model > is > > > the > > > > responsibility chain model, which can be compared to the Apache > commons > > > > chain[1], combining multiple Transformers can make data-processing > > become > > > > more flexible and expandable. > > > > > > > > If we enhance the capabilities of Transformer components, Hudi will > > > provide > > > > richer data processing capabilities based on the computing engine. > > > > > > > > What do you think? > > > > > > > > Any opinions and feedback are welcome and appreciated. > > > > > > > > Best, > > > > Vino > > > > > > > > [1]: https://commons.apache.org/proper/commons-chain/ > > > > > > > > > >
