Re: Refactor and enhance Hudi Transformer

Vinoth Chandar Mon, 10 Feb 2020 00:10:59 -0800

Thanks for kicking this discussion off...

At a high level, improving the deltastreamer tool to be able to better
support ETL pipelines is a great goal and we can do a lot more here to
help.


> Currently, Hudi has a component that has not been widely used:
Transformer.
Not true actually. The recent DMS integration was based off this and I know
of atleast two other users. At the end of the day, its very simple
interface that take a dataframe in and hands a dataframe out and it need
not be anymore complicated than that. On Flink, we can first get the
custom/handwritten pipeline working (ala hudi-spark),before we can extend
deltastreamer to flink. This is  much larger effort.

How about we first focus our discussion on how we can expand ETL support,
rather than zooming on this interface (which is a lower value conversation
IMO).
Here are some thoughts

- (vinoyang's point) We could support a lot of standard transformations
built into Hudi itself? This can include common timestamp extraction, field
masking/filtering and such.
- (hamid's point) Airflow is a workflow scheduler and people can use it to
schedule DeltaStreamer jobs/Spark jobs. IMO its orthogonal/complementary to
transforms we would support ourselves. But, I think we can provide some
real value in implementing some way to trigger data pipelines in Airflow
based on a Hudi dataset receiving new commits.. for e.g we could run an
incremental ETL every time a new commit lands on the source Hudi table?

Thanks
Vinoth





On Fri, Feb 7, 2020 at 6:11 PM vino yang <[email protected]> wrote:

> Hi hamid,
>
> AFAIK, currently, Transformer works as a task(Spark task) in the context of
> hudi reading data.
> It's not a single component out of Hudi.
> Can you describe more details about how to use Apache Airflow?
> I personally suggest that we have a premise here: our goal is to enhance
> the data preprocessing capabilities of hudi.
>
> Best,
> Vino
>
>
> hamid pirahesh <[email protected]> 于2020年2月8日周六 上午4:23写道：
>
> > What about using apache airflow for creating a DAG of
> > transformer operators?
> >
> > On Thu, Feb 6, 2020 at 8:25 PM vino yang <[email protected]> wrote:
> >
> > > Currently, Hudi has a component that has not been widely used:
> > Transformer.
> > > As we all know, before the original data fell into the data lake, a
> very
> > > common operation is data preprocessing and ETL. This is also the most
> > > common use scenario of many computing engines, such as Flink and Spark.
> > Now
> > > that Hudi has taken advantage of the power of the computing engine, it
> > can
> > > also naturally take advantage of its ability of data preprocessing. We
> > can
> > > refactor the Transformer to make it become more flexible. To summarize,
> > we
> > > can refactor from the following aspects:
> > >
> > >    - Decouple Transformer from Spark
> > >    - Enrich the Transformer and provide built-in transformer
> > >    - Support Transformer-chain
> > >
> > > For the first point, the Transformer interface is tightly coupled with
> > > Spark in design, and it contains a Spark-specific context. This makes
> it
> > > impossible for us to take advantage of the transform capabilities
> > provided
> > > by other engines (such as Flink) after supporting multiple engines.
> > > Therefore, we need to decouple it from Spark in design.
> > >
> > > For the second point, we can enhance the Transformer and provide some
> > > out-of-the-box Transformers, such as FilterTransformer,
> > FlatMapTrnasformer,
> > > and so on.
> > >
> > > For the third point, the most common pattern for data processing is the
> > > pipeline model, and the common implementation of the pipeline model is
> > the
> > > responsibility chain model, which can be compared to the Apache commons
> > > chain[1], combining multiple Transformers can make data-processing
> become
> > > more flexible and expandable.
> > >
> > > If we enhance the capabilities of Transformer components, Hudi will
> > provide
> > > richer data processing capabilities based on the computing engine.
> > >
> > > What do you think?
> > >
> > > Any opinions and feedback are welcome and appreciated.
> > >
> > > Best,
> > > Vino
> > >
> > > [1]: https://commons.apache.org/proper/commons-chain/
> > >
> >
>

Re: Refactor and enhance Hudi Transformer

Reply via email to