Greetings everyone,
My name is Xinyao and I'm currently working for an Insurance company. We found that Apache Hudi is an extremely awesome utility and when it cooprates with Apache Flink it can be even more powerful. Thus, we have been using it for months and still keep benefiting from it. However, there is one feature that we really desire but Hudi doesn't currently have: It is called "Multiple event_time fields verification". Because in the insurance industry, data is often stored distributed in dozens of tables and conceptually connected by same primary keys. When the data is being used, we often need to associate several or even dozens of tables through the Join operation, and stitch all partial columns into an entire record with dozens or even hundreds of columns for downstream services to use. Here comes to the problem. If we want to guarantee that every part of the data being joined is up to date, Hudi must have the ability to filter multiple event_time timestamps in a table and keep the most recent records. So, in this scenario, the signle event_time filtering field provided by Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit inadequate. Obviously, in order to cope with the use case with complex Join operations like above, as well as to provide much potential for Hudi to support more application scenarios and engage into more industries, Hudi definitely needs to support the multiple event_time timestamps filtering feature in a single table. A good news is that, after more than two months of development, me and my colleagues have made some changes in the hudi-flink and hudi-common modules based on the hudi-0.10.0 and basically have achieved this feature. Currently, my team is using the enhanced source code and working with Kafka and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more than 140 million real-world insurance data and verifying the accuracy of the data. The result is quite good: every part of the extremely-wide records have been updated to latest status based on our continuous observations during these weeks. We're very keen to make this new feature available to everyone. We benefit from the Hudi community, so we really desire to give back to the community with our efforts. The only problem is that, we are not sure whether we need to create a RFC to illusrtate our design and implementations in detail. According to "RFC Process" in Hudi official documentation, we have to confirm that this feature has not already exsited so that we could create a new RFC to share concept and code as well as explain them in detail. Thus, we really would like to create a new RFC that would explain our implementation in detail with theory and code, as well as make it easier for everyone to understand and make improvement based on our RFC. Look forward to receiving your feedback whether we should create a new RFC and make Hudi better and better to benifit everyone. Kind regards, Xinyao Tian