Hello,

We have a use-case where we have persisted the full CDC changelog for some
tables in s3 and want to be able to bootstrap hudi tables with the
changelog data and then be able to time-travel the hudi table to get
snapshot views of the table on dates prior to bootstrapping. In our
changelog, we have the timestamp associated with the
inserts/updates/deletes, so the data to achieve this is present. If we had
a live consumer processing those events in real-time and writing them to a
hudi table, then we would be able to achieve this, but because we are
instead creating the hudi table from a single batch job, we are unable to
achieve it despite processing the same exact data, since time-travel is all
based on the hudi commit time.

Aside from our specific use-case for bootstrapping tables, this would be
useful for real-time CDC consumers as well.  Currently, there is no way to
guarantee the accuracy of the time-travel operation as it relates to
reflecting the state of the upstream database table at a given point in
time. For example, say you have some downstream batch pipelines that want
to perform some aggregations based on production database tables at a fixed
point each day. In the case of lag or outage on the consumer-side, when the
consumer restarts, we have a large gap in hudi commit time and are unable
to time-travel to the exact moment that the downstream pipelines expect to
reflect the database table state.

If the hudi writer instead supported picking some field from the CDC record
as the value for the hudi commit time, then the consumer could process the
events at any time and the time-travel functionality would be the same
regardless of consumption time. This would make the writer idempotent in a
way that it currently lacks, guaranteeing consistent results for downstream
pipelines.

Original Slack Thread:
https://apache-hudi.slack.com/archives/C4D716NPQ/p1690583690053259

Reply via email to