Re: Refactor and enhance Hudi Transformer

Vinoth Chandar Wed, 12 Feb 2020 19:15:12 -0800

yes familiar with those systems (I actually named it marmaray :))..

I am not opposed to building a set of built in transformers for common
things, per se. It can actually help adoption of delta streamer as well


On Mon, Feb 10, 2020 at 6:27 PM vino yang <[email protected]> wrote:

> Hi Vinoth,
>
> Thanks for summarizing both our opinions. Your summarize is good.
>
> >> How about we first focus our discussion on how we can expand ETL
> support,
> rather than zooming on this interface (which is a lower value conversation
> IMO).
>
> Of course, Yes.
>
> About your summarize.
>
> The initial goal of this discussion topic was to talk about improving the
> Transformer component to help Hudi itself provide more powerful ETL
> capabilities. If it involves third-party frameworks (such as a scheduling
> engine), then its direction will shift to an "ecological" perspective,
> namely how to integrate with a third-party framework to better enable Hudi
> to support strong ETL capabilities. Of course, their goals are to develop
> in a good direction.
>
> Recently, I saw a data ingestion framework named marmaray[1] that open
> source by Uber (maybe you are familiar with this project). Hudi's
> Transformer is similar to the converter component it provides, so I
> launched this proposal to try to see if we can enhance the Transformer so
> that Hudi can have strong ETL characteristics while it does not rely on any
> other services.
>
> Of course, I absolutely agree with your second point. It is also necessary
> for us to provide convenience and flexibility for third parties to conduct
> ETL more conveniently.
>
> Best,
> Vino
>
> [1]: https://github.com/uber/marmaray
>
>
> Vinoth Chandar <[email protected]> 于2020年2月10日周一 下午4:10写道：
>
> > Thanks for kicking this discussion off...
> >
> > At a high level, improving the deltastreamer tool to be able to better
> > support ETL pipelines is a great goal and we can do a lot more here to
> > help.
> >
> > > Currently, Hudi has a component that has not been widely used:
> > Transformer.
> > Not true actually. The recent DMS integration was based off this and I
> know
> > of atleast two other users. At the end of the day, its very simple
> > interface that take a dataframe in and hands a dataframe out and it need
> > not be anymore complicated than that. On Flink, we can first get the
> > custom/handwritten pipeline working (ala hudi-spark),before we can extend
> > deltastreamer to flink. This is  much larger effort.
> >
> > How about we first focus our discussion on how we can expand ETL support,
> > rather than zooming on this interface (which is a lower value
> conversation
> > IMO).
> > Here are some thoughts
> >
> > - (vinoyang's point) We could support a lot of standard transformations
> > built into Hudi itself? This can include common timestamp extraction,
> field
> > masking/filtering and such.
> > - (hamid's point) Airflow is a workflow scheduler and people can use it
> to
> > schedule DeltaStreamer jobs/Spark jobs. IMO its orthogonal/complementary
> to
> > transforms we would support ourselves. But, I think we can provide some
> > real value in implementing some way to trigger data pipelines in Airflow
> > based on a Hudi dataset receiving new commits.. for e.g we could run an
> > incremental ETL every time a new commit lands on the source Hudi table?
> >
> > Thanks
> > Vinoth
> >
> >
> >
> >
> >
> > On Fri, Feb 7, 2020 at 6:11 PM vino yang <[email protected]> wrote:
> >
> > > Hi hamid,
> > >
> > > AFAIK, currently, Transformer works as a task(Spark task) in the
> context
> > of
> > > hudi reading data.
> > > It's not a single component out of Hudi.
> > > Can you describe more details about how to use Apache Airflow?
> > > I personally suggest that we have a premise here: our goal is to
> enhance
> > > the data preprocessing capabilities of hudi.
> > >
> > > Best,
> > > Vino
> > >
> > >
> > > hamid pirahesh <[email protected]> 于2020年2月8日周六 上午4:23写道：
> > >
> > > > What about using apache airflow for creating a DAG of
> > > > transformer operators?
> > > >
> > > > On Thu, Feb 6, 2020 at 8:25 PM vino yang <[email protected]>
> > wrote:
> > > >
> > > > > Currently, Hudi has a component that has not been widely used:
> > > > Transformer.
> > > > > As we all know, before the original data fell into the data lake, a
> > > very
> > > > > common operation is data preprocessing and ETL. This is also the
> most
> > > > > common use scenario of many computing engines, such as Flink and
> > Spark.
> > > > Now
> > > > > that Hudi has taken advantage of the power of the computing engine,
> > it
> > > > can
> > > > > also naturally take advantage of its ability of data preprocessing.
> > We
> > > > can
> > > > > refactor the Transformer to make it become more flexible. To
> > summarize,
> > > > we
> > > > > can refactor from the following aspects:
> > > > >
> > > > >    - Decouple Transformer from Spark
> > > > >    - Enrich the Transformer and provide built-in transformer
> > > > >    - Support Transformer-chain
> > > > >
> > > > > For the first point, the Transformer interface is tightly coupled
> > with
> > > > > Spark in design, and it contains a Spark-specific context. This
> makes
> > > it
> > > > > impossible for us to take advantage of the transform capabilities
> > > > provided
> > > > > by other engines (such as Flink) after supporting multiple engines.
> > > > > Therefore, we need to decouple it from Spark in design.
> > > > >
> > > > > For the second point, we can enhance the Transformer and provide
> some
> > > > > out-of-the-box Transformers, such as FilterTransformer,
> > > > FlatMapTrnasformer,
> > > > > and so on.
> > > > >
> > > > > For the third point, the most common pattern for data processing is
> > the
> > > > > pipeline model, and the common implementation of the pipeline model
> > is
> > > > the
> > > > > responsibility chain model, which can be compared to the Apache
> > commons
> > > > > chain[1], combining multiple Transformers can make data-processing
> > > become
> > > > > more flexible and expandable.
> > > > >
> > > > > If we enhance the capabilities of Transformer components, Hudi will
> > > > provide
> > > > > richer data processing capabilities based on the computing engine.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Any opinions and feedback are welcome and appreciated.
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > > [1]: https://commons.apache.org/proper/commons-chain/
> > > > >
> > > >
> > >
> >
>

Re: Refactor and enhance Hudi Transformer

Reply via email to