Hello, We have a use-case where we have persisted the full CDC changelog for some tables in s3 and want to be able to bootstrap hudi tables with the changelog data and then be able to time-travel the hudi table to get snapshot views of the table on dates prior to bootstrapping. In our changelog, we have the timestamp associated with the inserts/updates/deletes, so the data to achieve this is present. If we had a live consumer processing those events in real-time and writing them to a hudi table, then we would be able to achieve this, but because we are instead creating the hudi table from a single batch job, we are unable to achieve it despite processing the same exact data, since time-travel is all based on the hudi commit time.
Aside from our specific use-case for bootstrapping tables, this would be useful for real-time CDC consumers as well. Currently, there is no way to guarantee the accuracy of the time-travel operation as it relates to reflecting the state of the upstream database table at a given point in time. For example, say you have some downstream batch pipelines that want to perform some aggregations based on production database tables at a fixed point each day. In the case of lag or outage on the consumer-side, when the consumer restarts, we have a large gap in hudi commit time and are unable to time-travel to the exact moment that the downstream pipelines expect to reflect the database table state. If the hudi writer instead supported picking some field from the CDC record as the value for the hudi commit time, then the consumer could process the events at any time and the time-travel functionality would be the same regardless of consumption time. This would make the writer idempotent in a way that it currently lacks, guaranteeing consistent results for downstream pipelines. Original Slack Thread: https://apache-hudi.slack.com/archives/C4D716NPQ/p1690583690053259