kbuci opened a new issue, #17848: URL: https://github.com/apache/hudi/issues/17848
### Task Description **What needs to be done:** An ingestion write to a HUDI datasets may add a "checkpoint" information in the commit/replacecommit metadata, either in HUDI-defined/internal fields like "deltastreamer checkpoint", or user-defined metadata in the `extraMetadata` field. Regardless of clean/archival config or the amount of instants/time since the latest insert/upsert/bulk_insert/insert_overwrite instant , a user should be able to: - Retrieve the `extraMetadata` field of the latest insert/upsert/bulk_insert/insert_overwrite instant - Rely on HUDI streamer to automatically get the previous "deltastreamer checkpoint" value **Why this task is needed:** The checkpoint info may be "lost" when the following scenario happens: 1. A backfill of other writes, such as clustering, creates many instants on timeline 2. Archival runs and archives the latest ingestion instant 3. The "checkpoint" info is no longer in the active timeline We have encountered this scenario in our incremental ingestion workloads, where upon "losing" the checkpoint we need to manually intervene to add it again. Currently in our organization's internal 0.x HUDI build we have prevented this issue by - Ensuring archival doesn't archive latest write with checkpoint info - Adding an "empty commit" API (which transfers over the checkpoint info) https://github.com/apache/hudi/pull/11606 which will automatically create a new empty write every x hours ### Task Type Code improvement/refactoring ### Related Issues **Parent feature issue:** (if applicable ) **Related issues:** NOTE: Use `Relationships` button to add parent/blocking issues after issue is created. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
