>> Oops, the image crushes, for "change flags", i mean: insert, update(before and after) and delete.
Yes, the image I attached is also about these flags. [image: image (3).png] +1 for the idea. Best, Vino Danny Chan <danny0...@apache.org> 于2021年4月1日周四 上午10:03写道: > Oops, the image crushes, for "change flags", i mean: insert, update(before > and after) and delete. > > The Flink engine can propagate the change flags internally between its > operators, if HUDI can send the change flags to Flink, the incremental > calculation of CDC would be very natural (almost transparent to users). > > Best, > Danny Chan > > vino yang <yanghua1...@gmail.com> 于2021年3月31日周三 下午11:32写道: > > > Hi Danny, > > > > Thanks for kicking off this discussion thread. > > > > Yes, incremental query( or says "incremental processing") has always been > > an important feature of the Hudi framework. If we can make this feature > > better, it will be even more exciting. > > > > In the data warehouse, in some complex calculations, I have not found a > > good way to conveniently use some incremental change data (similar to the > > concept of retracement stream in Flink?) to locally "correct" the > > aggregation result (these aggregation results may belong to the DWS > layer). > > > > BTW: Yes, I do admit that some simple calculation scenarios (single table > > or an algorithm that can be very easily retracement) can be dealt with > > based on the incremental calculation of CDC. > > > > Of course, the expression of incremental calculation on various occasions > > is sometimes not very clear. Maybe we will discuss it more clearly in > > specific scenarios. > > > > >> If HUDI can keep and propagate these change flags to its consumers, we > > can > > use HUDI as the unified format for the pipeline. > > > > Regarding the "change flags" here, do you mean the flags like the one > > shown in the figure below? > > > > [image: image.png] > > > > Best, > > Vino > > > > Danny Chan <danny0...@apache.org> 于2021年3月31日周三 下午6:24写道: > > > >> Hi dear HUDI community ~ Here i want to fire a discuss about using HUDI > as > >> the unified storage/format for data warehouse/lake incremental > >> computation. > >> > >> Usually people divide data warehouse production into several levels, > such > >> as the ODS(operation data store), DWD(data warehouse details), DWS(data > >> warehouse service), ADS(application data service). > >> > >> > >> ODS -> DWD -> DWS -> ADS > >> > >> In the NEAR-REAL-TIME (or pure realtime) computation cases, a big topic > is > >> syncing the change log(CDC pattern) from all kinds of RDBMS into the > >> warehouse/lake, the cdc patten records and propagate the change flag: > >> insert, update(before and after) and delete for the consumer, with these > >> flags, the downstream engines can have a realtime accumulation > >> computation. > >> > >> Using streaming engine like Flink, we can have a totally NEAR-REAL-TIME > >> computation pipeline for each of the layer. > >> > >> If HUDI can keep and propagate these change flags to its consumers, we > can > >> use HUDI as the unified format for the pipeline. > >> > >> I'm expecting your nice ideas here ~ > >> > >> Best, > >> Danny Chan > >> > > >