Re: [DISCUSS] Incremental computation pipeline for HUDI

vino yang Wed, 31 Mar 2021 19:58:18 -0700

>> Oops, the image crushes, for "change flags", i mean: insert,
update(before
and after) and delete.


Yes, the image I attached is also about these flags.
[image: image (3).png]

+1 for the idea.

Best,
Vino


Danny Chan <danny0...@apache.org> 于2021年4月1日周四 上午10:03写道：

> Oops, the image crushes, for "change flags", i mean: insert, update(before
> and after) and delete.
>
> The Flink engine can propagate the change flags internally between its
> operators, if HUDI can send the change flags to Flink, the incremental
> calculation of CDC would be very natural (almost transparent to users).
>
> Best,
> Danny Chan
>
> vino yang <yanghua1...@gmail.com> 于2021年3月31日周三 下午11:32写道：
>
> > Hi Danny,
> >
> > Thanks for kicking off this discussion thread.
> >
> > Yes, incremental query( or says "incremental processing") has always been
> > an important feature of the Hudi framework. If we can make this feature
> > better, it will be even more exciting.
> >
> > In the data warehouse, in some complex calculations, I have not found a
> > good way to conveniently use some incremental change data (similar to the
> > concept of retracement stream in Flink?) to locally "correct" the
> > aggregation result (these aggregation results may belong to the DWS
> layer).
> >
> > BTW: Yes, I do admit that some simple calculation scenarios (single table
> > or an algorithm that can be very easily retracement) can be dealt with
> > based on the incremental calculation of CDC.
> >
> > Of course, the expression of incremental calculation on various occasions
> > is sometimes not very clear. Maybe we will discuss it more clearly in
> > specific scenarios.
> >
> > >> If HUDI can keep and propagate these change flags to its consumers, we
> > can
> > use HUDI as the unified format for the pipeline.
> >
> > Regarding the "change flags" here, do you mean the flags like the one
> > shown in the figure below?
> >
> > [image: image.png]
> >
> > Best,
> > Vino
> >
> > Danny Chan <danny0...@apache.org> 于2021年3月31日周三 下午6:24写道：
> >
> >> Hi dear HUDI community ~ Here i want to fire a discuss about using HUDI
> as
> >> the unified storage/format for data warehouse/lake incremental
> >> computation.
> >>
> >> Usually people divide data warehouse production into several levels,
> such
> >> as the ODS(operation data store), DWD(data warehouse details), DWS(data
> >> warehouse service), ADS(application data service).
> >>
> >>
> >> ODS -> DWD -> DWS -> ADS
> >>
> >> In the NEAR-REAL-TIME (or pure realtime) computation cases, a big topic
> is
> >> syncing the change log(CDC pattern) from all kinds of RDBMS into the
> >> warehouse/lake, the cdc patten records and propagate the change flag:
> >> insert, update(before and after) and delete for the consumer, with these
> >> flags, the downstream engines can have a realtime accumulation
> >> computation.
> >>
> >> Using streaming engine like Flink, we can have a totally NEAR-REAL-TIME
> >> computation pipeline for each of the layer.
> >>
> >> If HUDI can keep and propagate these change flags to its consumers, we
> can
> >> use HUDI as the unified format for the pipeline.
> >>
> >> I'm expecting your nice ideas here ~
> >>
> >> Best,
> >> Danny Chan
> >>
> >
>

Re: [DISCUSS] Incremental computation pipeline for HUDI

Reply via email to