yes familiar with those systems (I actually named it marmaray :)).. I am not opposed to building a set of built in transformers for common things, per se. It can actually help adoption of delta streamer as well
On Mon, Feb 10, 2020 at 6:27 PM vino yang <[email protected]> wrote: > Hi Vinoth, > > Thanks for summarizing both our opinions. Your summarize is good. > > >> How about we first focus our discussion on how we can expand ETL > support, > rather than zooming on this interface (which is a lower value conversation > IMO). > > Of course, Yes. > > About your summarize. > > The initial goal of this discussion topic was to talk about improving the > Transformer component to help Hudi itself provide more powerful ETL > capabilities. If it involves third-party frameworks (such as a scheduling > engine), then its direction will shift to an "ecological" perspective, > namely how to integrate with a third-party framework to better enable Hudi > to support strong ETL capabilities. Of course, their goals are to develop > in a good direction. > > Recently, I saw a data ingestion framework named marmaray[1] that open > source by Uber (maybe you are familiar with this project). Hudi's > Transformer is similar to the converter component it provides, so I > launched this proposal to try to see if we can enhance the Transformer so > that Hudi can have strong ETL characteristics while it does not rely on any > other services. > > Of course, I absolutely agree with your second point. It is also necessary > for us to provide convenience and flexibility for third parties to conduct > ETL more conveniently. > > Best, > Vino > > [1]: https://github.com/uber/marmaray > > > Vinoth Chandar <[email protected]> 于2020年2月10日周一 下午4:10写道: > > > Thanks for kicking this discussion off... > > > > At a high level, improving the deltastreamer tool to be able to better > > support ETL pipelines is a great goal and we can do a lot more here to > > help. > > > > > Currently, Hudi has a component that has not been widely used: > > Transformer. > > Not true actually. The recent DMS integration was based off this and I > know > > of atleast two other users. At the end of the day, its very simple > > interface that take a dataframe in and hands a dataframe out and it need > > not be anymore complicated than that. On Flink, we can first get the > > custom/handwritten pipeline working (ala hudi-spark),before we can extend > > deltastreamer to flink. This is much larger effort. > > > > How about we first focus our discussion on how we can expand ETL support, > > rather than zooming on this interface (which is a lower value > conversation > > IMO). > > Here are some thoughts > > > > - (vinoyang's point) We could support a lot of standard transformations > > built into Hudi itself? This can include common timestamp extraction, > field > > masking/filtering and such. > > - (hamid's point) Airflow is a workflow scheduler and people can use it > to > > schedule DeltaStreamer jobs/Spark jobs. IMO its orthogonal/complementary > to > > transforms we would support ourselves. But, I think we can provide some > > real value in implementing some way to trigger data pipelines in Airflow > > based on a Hudi dataset receiving new commits.. for e.g we could run an > > incremental ETL every time a new commit lands on the source Hudi table? > > > > Thanks > > Vinoth > > > > > > > > > > > > On Fri, Feb 7, 2020 at 6:11 PM vino yang <[email protected]> wrote: > > > > > Hi hamid, > > > > > > AFAIK, currently, Transformer works as a task(Spark task) in the > context > > of > > > hudi reading data. > > > It's not a single component out of Hudi. > > > Can you describe more details about how to use Apache Airflow? > > > I personally suggest that we have a premise here: our goal is to > enhance > > > the data preprocessing capabilities of hudi. > > > > > > Best, > > > Vino > > > > > > > > > hamid pirahesh <[email protected]> 于2020年2月8日周六 上午4:23写道: > > > > > > > What about using apache airflow for creating a DAG of > > > > transformer operators? > > > > > > > > On Thu, Feb 6, 2020 at 8:25 PM vino yang <[email protected]> > > wrote: > > > > > > > > > Currently, Hudi has a component that has not been widely used: > > > > Transformer. > > > > > As we all know, before the original data fell into the data lake, a > > > very > > > > > common operation is data preprocessing and ETL. This is also the > most > > > > > common use scenario of many computing engines, such as Flink and > > Spark. > > > > Now > > > > > that Hudi has taken advantage of the power of the computing engine, > > it > > > > can > > > > > also naturally take advantage of its ability of data preprocessing. > > We > > > > can > > > > > refactor the Transformer to make it become more flexible. To > > summarize, > > > > we > > > > > can refactor from the following aspects: > > > > > > > > > > - Decouple Transformer from Spark > > > > > - Enrich the Transformer and provide built-in transformer > > > > > - Support Transformer-chain > > > > > > > > > > For the first point, the Transformer interface is tightly coupled > > with > > > > > Spark in design, and it contains a Spark-specific context. This > makes > > > it > > > > > impossible for us to take advantage of the transform capabilities > > > > provided > > > > > by other engines (such as Flink) after supporting multiple engines. > > > > > Therefore, we need to decouple it from Spark in design. > > > > > > > > > > For the second point, we can enhance the Transformer and provide > some > > > > > out-of-the-box Transformers, such as FilterTransformer, > > > > FlatMapTrnasformer, > > > > > and so on. > > > > > > > > > > For the third point, the most common pattern for data processing is > > the > > > > > pipeline model, and the common implementation of the pipeline model > > is > > > > the > > > > > responsibility chain model, which can be compared to the Apache > > commons > > > > > chain[1], combining multiple Transformers can make data-processing > > > become > > > > > more flexible and expandable. > > > > > > > > > > If we enhance the capabilities of Transformer components, Hudi will > > > > provide > > > > > richer data processing capabilities based on the computing engine. > > > > > > > > > > What do you think? > > > > > > > > > > Any opinions and feedback are welcome and appreciated. > > > > > > > > > > Best, > > > > > Vino > > > > > > > > > > [1]: https://commons.apache.org/proper/commons-chain/ > > > > > > > > > > > > > > >
