Re: [DISCUSS] Incremental computation pipeline for HUDI

Danny Chan Thu, 01 Apr 2021 20:09:05 -0700

Thanks cool, then the left questions are:

- where we record these change, should we add a builtin meta field such as
the _change_flag_ like the other system columns for e.g _hoodie_commit_time
- what kind of table should keep these flags, in my thoughts, we should
only add these flags for "MERGE_ON_READ" table, and only for AVRO logs
- we should add a config there to switch on/off the flags in system meta
fields


What do you think?

Best,
Danny Chan

vino yang <yanghua1...@gmail.com> 于2021年4月1日周四 上午10:58写道：

> >> Oops, the image crushes, for "change flags", i mean: insert,
> update(before
> and after) and delete.
>
> Yes, the image I attached is also about these flags.
> [image: image (3).png]
>
> +1 for the idea.
>
> Best,
> Vino
>
>
> Danny Chan <danny0...@apache.org> 于2021年4月1日周四 上午10:03写道：
>
>> Oops, the image crushes, for "change flags", i mean: insert, update(before
>> and after) and delete.
>>
>> The Flink engine can propagate the change flags internally between its
>> operators, if HUDI can send the change flags to Flink, the incremental
>> calculation of CDC would be very natural (almost transparent to users).
>>
>> Best,
>> Danny Chan
>>
>> vino yang <yanghua1...@gmail.com> 于2021年3月31日周三 下午11:32写道：
>>
>> > Hi Danny,
>> >
>> > Thanks for kicking off this discussion thread.
>> >
>> > Yes, incremental query( or says "incremental processing") has always
>> been
>> > an important feature of the Hudi framework. If we can make this feature
>> > better, it will be even more exciting.
>> >
>> > In the data warehouse, in some complex calculations, I have not found a
>> > good way to conveniently use some incremental change data (similar to
>> the
>> > concept of retracement stream in Flink?) to locally "correct" the
>> > aggregation result (these aggregation results may belong to the DWS
>> layer).
>> >
>> > BTW: Yes, I do admit that some simple calculation scenarios (single
>> table
>> > or an algorithm that can be very easily retracement) can be dealt with
>> > based on the incremental calculation of CDC.
>> >
>> > Of course, the expression of incremental calculation on various
>> occasions
>> > is sometimes not very clear. Maybe we will discuss it more clearly in
>> > specific scenarios.
>> >
>> > >> If HUDI can keep and propagate these change flags to its consumers,
>> we
>> > can
>> > use HUDI as the unified format for the pipeline.
>> >
>> > Regarding the "change flags" here, do you mean the flags like the one
>> > shown in the figure below?
>> >
>> > [image: image.png]
>> >
>> > Best,
>> > Vino
>> >
>> > Danny Chan <danny0...@apache.org> 于2021年3月31日周三 下午6:24写道：
>> >
>> >> Hi dear HUDI community ~ Here i want to fire a discuss about using
>> HUDI as
>> >> the unified storage/format for data warehouse/lake incremental
>> >> computation.
>> >>
>> >> Usually people divide data warehouse production into several levels,
>> such
>> >> as the ODS(operation data store), DWD(data warehouse details), DWS(data
>> >> warehouse service), ADS(application data service).
>> >>
>> >>
>> >> ODS -> DWD -> DWS -> ADS
>> >>
>> >> In the NEAR-REAL-TIME (or pure realtime) computation cases, a big
>> topic is
>> >> syncing the change log(CDC pattern) from all kinds of RDBMS into the
>> >> warehouse/lake, the cdc patten records and propagate the change flag:
>> >> insert, update(before and after) and delete for the consumer, with
>> these
>> >> flags, the downstream engines can have a realtime accumulation
>> >> computation.
>> >>
>> >> Using streaming engine like Flink, we can have a totally NEAR-REAL-TIME
>> >> computation pipeline for each of the layer.
>> >>
>> >> If HUDI can keep and propagate these change flags to its consumers, we
>> can
>> >> use HUDI as the unified format for the pipeline.
>> >>
>> >> I'm expecting your nice ideas here ~
>> >>
>> >> Best,
>> >> Danny Chan
>> >>
>> >
>>
>

Re: [DISCUSS] Incremental computation pipeline for HUDI

Reply via email to