Re: Refactor and enhance Hudi Transformer

Shiyan Xu Sun, 23 Feb 2020 15:22:56 -0800

Late to the party. :P

I really favor the idea of built-in support enrichment. It is a very common
case where we want to set datetime fields for partition path. We could have
a built-in support to normalize ISO format / unix timestamp. For example
`HourlyPartitionTransformer` will normalize whatever field user specified
as partition path. Let's say user set `create_ts` as partition path field,
the transfromer will apply change create_ts => _hoodie_partition_path



   - 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
   - 1582497702.123456789 => 2020/02/23/22

Does that make sense? If so, I may file a jira for this.

As for FilterTransformer or FlatMapTransformer which is designed for
generic purpose, they seem to belong to Spark or Flink's realm.
You can do these 2 transformation with Spark Dataset now. Or once
decoupled from Spark, you'll probably have an abstract Dataset class
to perform engine-agnostic transformation

My understanding of transformer in HUDI is more specifically purposed,
where the underlying transformation is handled by the actual
processing engine (Spark or Flink)


On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar <vin...@apache.org> wrote:

> Thanks Hamid and Vinoyang for the great discussion
>
> On Fri, Feb 14, 2020 at 5:18 AM vino yang <yanghua1...@gmail.com> wrote:
>
> > I have filed a Jira issue[1] to track this work.
> >
> > [1]: https://issues.apache.org/jira/browse/HUDI-613
> >
> > vino yang <yanghua1...@gmail.com> 于2020年2月13日周四 下午9:51写道：
> >
> > > Hi hamid,
> > >
> > > Agree with your opinion.
> > >
> > > Let's move forward step by step.
> > >
> > > Will file an issue to track refactor about Transformer.
> > >
> > > Best,
> > > Vino
> > >
> > > hamid pirahesh <hpirah...@gmail.com> 于2020年2月13日周四 下午6:38写道：
> > >
> > >> I think it is a good idea to decouple  the transformer from spark so
> > that
> > >> it can be used with other flow engines.
> > >> Once you do that, then it is worth considering a much bigger play
> rather
> > >> than another incremental play.
> > >> Given the scale of Hudi, we need to look at airflow, particularly in
> the
> > >> context of what google is doing with Composer, addressing autoscaling,
> > >> scheduleing, monitoring, etc.
> > >> You need all of that to manage a serious tetl/elt flow.
> > >>
> > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang <yanghua1...@gmail.com>
> wrote:
> > >>
> > >> > Currently, Hudi has a component that has not been widely used:
> > >> Transformer.
> > >> > As we all know, before the original data fell into the data lake, a
> > very
> > >> > common operation is data preprocessing and ETL. This is also the
> most
> > >> > common use scenario of many computing engines, such as Flink and
> > Spark.
> > >> Now
> > >> > that Hudi has taken advantage of the power of the computing engine,
> it
> > >> can
> > >> > also naturally take advantage of its ability of data preprocessing.
> We
> > >> can
> > >> > refactor the Transformer to make it become more flexible. To
> > summarize,
> > >> we
> > >> > can refactor from the following aspects:
> > >> >
> > >> >    - Decouple Transformer from Spark
> > >> >    - Enrich the Transformer and provide built-in transformer
> > >> >    - Support Transformer-chain
> > >> >
> > >> > For the first point, the Transformer interface is tightly coupled
> with
> > >> > Spark in design, and it contains a Spark-specific context. This
> makes
> > it
> > >> > impossible for us to take advantage of the transform capabilities
> > >> provided
> > >> > by other engines (such as Flink) after supporting multiple engines.
> > >> > Therefore, we need to decouple it from Spark in design.
> > >> >
> > >> > For the second point, we can enhance the Transformer and provide
> some
> > >> > out-of-the-box Transformers, such as FilterTransformer,
> > >> FlatMapTrnasformer,
> > >> > and so on.
> > >> >
> > >> > For the third point, the most common pattern for data processing is
> > the
> > >> > pipeline model, and the common implementation of the pipeline model
> is
> > >> the
> > >> > responsibility chain model, which can be compared to the Apache
> > commons
> > >> > chain[1], combining multiple Transformers can make data-processing
> > >> become
> > >> > more flexible and expandable.
> > >> >
> > >> > If we enhance the capabilities of Transformer components, Hudi will
> > >> provide
> > >> > richer data processing capabilities based on the computing engine.
> > >> >
> > >> > What do you think?
> > >> >
> > >> > Any opinions and feedback are welcome and appreciated.
> > >> >
> > >> > Best,
> > >> > Vino
> > >> >
> > >> > [1]: https://commons.apache.org/proper/commons-chain/
> > >> >
> > >>
> > >
> >
>

Re: Refactor and enhance Hudi Transformer

Reply via email to