Thanks cool, then the left questions are: - where we record these change, should we add a builtin meta field such as the _change_flag_ like the other system columns for e.g _hoodie_commit_time - what kind of table should keep these flags, in my thoughts, we should only add these flags for "MERGE_ON_READ" table, and only for AVRO logs - we should add a config there to switch on/off the flags in system meta fields
What do you think? Best, Danny Chan vino yang <yanghua1...@gmail.com> 于2021年4月1日周四 上午10:58写道: > >> Oops, the image crushes, for "change flags", i mean: insert, > update(before > and after) and delete. > > Yes, the image I attached is also about these flags. > [image: image (3).png] > > +1 for the idea. > > Best, > Vino > > > Danny Chan <danny0...@apache.org> 于2021年4月1日周四 上午10:03写道: > >> Oops, the image crushes, for "change flags", i mean: insert, update(before >> and after) and delete. >> >> The Flink engine can propagate the change flags internally between its >> operators, if HUDI can send the change flags to Flink, the incremental >> calculation of CDC would be very natural (almost transparent to users). >> >> Best, >> Danny Chan >> >> vino yang <yanghua1...@gmail.com> 于2021年3月31日周三 下午11:32写道: >> >> > Hi Danny, >> > >> > Thanks for kicking off this discussion thread. >> > >> > Yes, incremental query( or says "incremental processing") has always >> been >> > an important feature of the Hudi framework. If we can make this feature >> > better, it will be even more exciting. >> > >> > In the data warehouse, in some complex calculations, I have not found a >> > good way to conveniently use some incremental change data (similar to >> the >> > concept of retracement stream in Flink?) to locally "correct" the >> > aggregation result (these aggregation results may belong to the DWS >> layer). >> > >> > BTW: Yes, I do admit that some simple calculation scenarios (single >> table >> > or an algorithm that can be very easily retracement) can be dealt with >> > based on the incremental calculation of CDC. >> > >> > Of course, the expression of incremental calculation on various >> occasions >> > is sometimes not very clear. Maybe we will discuss it more clearly in >> > specific scenarios. >> > >> > >> If HUDI can keep and propagate these change flags to its consumers, >> we >> > can >> > use HUDI as the unified format for the pipeline. >> > >> > Regarding the "change flags" here, do you mean the flags like the one >> > shown in the figure below? >> > >> > [image: image.png] >> > >> > Best, >> > Vino >> > >> > Danny Chan <danny0...@apache.org> 于2021年3月31日周三 下午6:24写道: >> > >> >> Hi dear HUDI community ~ Here i want to fire a discuss about using >> HUDI as >> >> the unified storage/format for data warehouse/lake incremental >> >> computation. >> >> >> >> Usually people divide data warehouse production into several levels, >> such >> >> as the ODS(operation data store), DWD(data warehouse details), DWS(data >> >> warehouse service), ADS(application data service). >> >> >> >> >> >> ODS -> DWD -> DWS -> ADS >> >> >> >> In the NEAR-REAL-TIME (or pure realtime) computation cases, a big >> topic is >> >> syncing the change log(CDC pattern) from all kinds of RDBMS into the >> >> warehouse/lake, the cdc patten records and propagate the change flag: >> >> insert, update(before and after) and delete for the consumer, with >> these >> >> flags, the downstream engines can have a realtime accumulation >> >> computation. >> >> >> >> Using streaming engine like Flink, we can have a totally NEAR-REAL-TIME >> >> computation pipeline for each of the layer. >> >> >> >> If HUDI can keep and propagate these change flags to its consumers, we >> can >> >> use HUDI as the unified format for the pipeline. >> >> >> >> I'm expecting your nice ideas here ~ >> >> >> >> Best, >> >> Danny Chan >> >> >> > >> >