[new RFC Request] The need of Multiple event_time fields verification

Xinyao Tian Fri, 05 Aug 2022 02:21:45 -0700

Greetings everyone,


My name is Xinyao and I'm currently working for an Insurance company. We found 
that Apache Hudi is an extremely awesome utility and when it cooprates with 
Apache Flink it can be even more powerful. Thus, we have been using it for 
months and still keep benefiting from it.


However, there is one feature that we really desire but Hudi doesn't currently 
have: It is called "Multiple event_time fields verification". Because in the 
insurance industry, data is often stored distributed in dozens of tables and 
conceptually connected by same primary keys. When the data is being used, we 
often need to associate several or even dozens of tables through the Join 
operation, and stitch all partial columns into an entire record with dozens or 
even hundreds of columns for downstream services to use. 


Here comes to the problem. If we want to guarantee that every part of the data 
being joined is up to date, Hudi must have the ability to filter multiple 
event_time timestamps in a table and keep the most recent records. So, in this 
scenario, the signle event_time filtering field provided by Hudi (i.e. option 
'write.precombine.field' in Hudi 0.10.0) is a bit inadequate. Obviously, in 
order to cope with the use case with complex Join operations like above, as 
well as to provide much potential for Hudi to support more application 
scenarios and engage into more industries, Hudi definitely needs to support the 
multiple event_time timestamps filtering feature in a single table.


A good news is that, after more than two months of development, me and my 
colleagues have made some changes in the hudi-flink and hudi-common modules 
based on the hudi-0.10.0 and basically have achieved this feature. Currently, 
my team is using the enhanced source code and working with Kafka and Flink 
1.13.2 to conduct some end-to-end testing on a dataset of more than 140 million 
real-world insurance data and verifying the accuracy of the data. The result is 
quite good: every part of the extremely-wide records have been updated to 
latest status based on our continuous observations during these weeks. We're 
very keen to make this new feature available to everyone. We benefit from the 
Hudi community, so we really desire to give back to the community with our 
efforts.


The only problem is that, we are not sure whether we need to create a RFC to 
illusrtate our design and implementations in detail. According to "RFC Process" 
in Hudi official documentation, we have to confirm that this feature has not 
already exsited so that we could create a new RFC to share concept and code as 
well as explain them in detail. Thus, we really would like to create a new RFC 
that would explain our implementation in detail with theory and code, as well 
as make it easier for everyone to understand and make improvement based on our 
RFC. 


Look forward to receiving your feedback whether we should create a new RFC and 
make Hudi better and better to benifit everyone.


Kind regards,
Xinyao Tian

[new RFC Request] The need of Multiple event_time fields verification

Reply via email to