Re: Refactor and enhance Hudi Transformer

Shiyan Xu Sun, 23 Feb 2020 23:52:00 -0800

Thanks. After reading the discussion in HUDI-561, I just realized that the
previously-mentioned built-in partition transformer is better suited to a
custom key generator. Hopefully other suitable ideas of built-in
transformer would come up later.


On Sun, Feb 23, 2020 at 6:34 PM vino yang <[email protected]> wrote:

> Hi Shiyan,
>
> Really sorry, I forgot to attach the reference, the relevant Jira ID is
> HUDI-561: https://issues.apache.org/jira/browse/HUDI-561
>
> It seems both of you faced the same issue. While the solution is not the
> same. Never mind, you can move the discussion to that issue.
>
> Best,
> Vino
>
>
> Shiyan Xu <[email protected]> 于2020年2月24日周一 上午10:21写道：
>
> > Thanks Vino. Are you referring to HUDI-613? How about making it an
> umbrella
> > task due to its big scope? (btw it is stated as "bug", which should be
> > fixed too). I can create another specific task under it for the idea of
> > datetime -> partition path transformer, if it makes sense.
> >
> > On Sun, Feb 23, 2020 at 5:57 PM vino yang <[email protected]> wrote:
> >
> > > Hi Shiyan,
> > >
> > > Thanks for rasing this thread up again and sharing your thoughts. They
> > are
> > > valuable.
> > >
> > > Regarding the date-time specific transform, there is an issue[1] that
> > > describes this business requirement.
> > >
> > > Best,
> > > Vino
> > >
> > > Shiyan Xu <[email protected]> 于2020年2月24日周一 上午7:22写道：
> > >
> > > > Late to the party. :P
> > > >
> > > > I really favor the idea of built-in support enrichment. It is a very
> > > common
> > > > case where we want to set datetime fields for partition path. We
> could
> > > have
> > > > a built-in support to normalize ISO format / unix timestamp. For
> > example
> > > > `HourlyPartitionTransformer` will normalize whatever field user
> > specified
> > > > as partition path. Let's say user set `create_ts` as partition path
> > > field,
> > > > the transfromer will apply change create_ts => _hoodie_partition_path
> > > >
> > > >
> > > >    - 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
> > > >    - 1582497702.123456789 => 2020/02/23/22
> > > >
> > > > Does that make sense? If so, I may file a jira for this.
> > > >
> > > > As for FilterTransformer or FlatMapTransformer which is designed for
> > > > generic purpose, they seem to belong to Spark or Flink's realm.
> > > > You can do these 2 transformation with Spark Dataset now. Or once
> > > > decoupled from Spark, you'll probably have an abstract Dataset class
> > > > to perform engine-agnostic transformation
> > > >
> > > > My understanding of transformer in HUDI is more specifically
> purposed,
> > > > where the underlying transformation is handled by the actual
> > > > processing engine (Spark or Flink)
> > > >
> > > >
> > > > On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar <[email protected]>
> > > wrote:
> > > >
> > > > > Thanks Hamid and Vinoyang for the great discussion
> > > > >
> > > > > On Fri, Feb 14, 2020 at 5:18 AM vino yang <[email protected]>
> > > wrote:
> > > > >
> > > > > > I have filed a Jira issue[1] to track this work.
> > > > > >
> > > > > > [1]: https://issues.apache.org/jira/browse/HUDI-613
> > > > > >
> > > > > > vino yang <[email protected]> 于2020年2月13日周四 下午9:51写道：
> > > > > >
> > > > > > > Hi hamid,
> > > > > > >
> > > > > > > Agree with your opinion.
> > > > > > >
> > > > > > > Let's move forward step by step.
> > > > > > >
> > > > > > > Will file an issue to track refactor about Transformer.
> > > > > > >
> > > > > > > Best,
> > > > > > > Vino
> > > > > > >
> > > > > > > hamid pirahesh <[email protected]> 于2020年2月13日周四 下午6:38写道：
> > > > > > >
> > > > > > >> I think it is a good idea to decouple  the transformer from
> > spark
> > > so
> > > > > > that
> > > > > > >> it can be used with other flow engines.
> > > > > > >> Once you do that, then it is worth considering a much bigger
> > play
> > > > > rather
> > > > > > >> than another incremental play.
> > > > > > >> Given the scale of Hudi, we need to look at airflow,
> > particularly
> > > in
> > > > > the
> > > > > > >> context of what google is doing with Composer, addressing
> > > > autoscaling,
> > > > > > >> scheduleing, monitoring, etc.
> > > > > > >> You need all of that to manage a serious tetl/elt flow.
> > > > > > >>
> > > > > > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang <
> [email protected]
> > >
> > > > > wrote:
> > > > > > >>
> > > > > > >> > Currently, Hudi has a component that has not been widely
> used:
> > > > > > >> Transformer.
> > > > > > >> > As we all know, before the original data fell into the data
> > > lake,
> > > > a
> > > > > > very
> > > > > > >> > common operation is data preprocessing and ETL. This is also
> > the
> > > > > most
> > > > > > >> > common use scenario of many computing engines, such as Flink
> > and
> > > > > > Spark.
> > > > > > >> Now
> > > > > > >> > that Hudi has taken advantage of the power of the computing
> > > > engine,
> > > > > it
> > > > > > >> can
> > > > > > >> > also naturally take advantage of its ability of data
> > > > preprocessing.
> > > > > We
> > > > > > >> can
> > > > > > >> > refactor the Transformer to make it become more flexible. To
> > > > > > summarize,
> > > > > > >> we
> > > > > > >> > can refactor from the following aspects:
> > > > > > >> >
> > > > > > >> >    - Decouple Transformer from Spark
> > > > > > >> >    - Enrich the Transformer and provide built-in transformer
> > > > > > >> >    - Support Transformer-chain
> > > > > > >> >
> > > > > > >> > For the first point, the Transformer interface is tightly
> > > coupled
> > > > > with
> > > > > > >> > Spark in design, and it contains a Spark-specific context.
> > This
> > > > > makes
> > > > > > it
> > > > > > >> > impossible for us to take advantage of the transform
> > > capabilities
> > > > > > >> provided
> > > > > > >> > by other engines (such as Flink) after supporting multiple
> > > > engines.
> > > > > > >> > Therefore, we need to decouple it from Spark in design.
> > > > > > >> >
> > > > > > >> > For the second point, we can enhance the Transformer and
> > provide
> > > > > some
> > > > > > >> > out-of-the-box Transformers, such as FilterTransformer,
> > > > > > >> FlatMapTrnasformer,
> > > > > > >> > and so on.
> > > > > > >> >
> > > > > > >> > For the third point, the most common pattern for data
> > processing
> > > > is
> > > > > > the
> > > > > > >> > pipeline model, and the common implementation of the
> pipeline
> > > > model
> > > > > is
> > > > > > >> the
> > > > > > >> > responsibility chain model, which can be compared to the
> > Apache
> > > > > > commons
> > > > > > >> > chain[1], combining multiple Transformers can make
> > > data-processing
> > > > > > >> become
> > > > > > >> > more flexible and expandable.
> > > > > > >> >
> > > > > > >> > If we enhance the capabilities of Transformer components,
> Hudi
> > > > will
> > > > > > >> provide
> > > > > > >> > richer data processing capabilities based on the computing
> > > engine.
> > > > > > >> >
> > > > > > >> > What do you think?
> > > > > > >> >
> > > > > > >> > Any opinions and feedback are welcome and appreciated.
> > > > > > >> >
> > > > > > >> > Best,
> > > > > > >> > Vino
> > > > > > >> >
> > > > > > >> > [1]: https://commons.apache.org/proper/commons-chain/
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Refactor and enhance Hudi Transformer

Reply via email to