Re: [I] [to be discussed] previous checkpoints from writes should always be available to consume [hudi]

via GitHub Wed, 21 Jan 2026 11:09:00 -0800


kbuci commented on issue #17848:
URL: https://github.com/apache/hudi/issues/17848#issuecomment-3780686278


   Sure let me add some context, the two "checkpoints" we use internally are 
   
   - For some jobs we use deltastreamer. Which means they implicitly use 
HoodieStreamer's internal checkpoint , that you mentioned. I believe it is 
called `deltastreamer.checkpoint.key`
   - For other jobs we use HUDI spark batch writer to read records from a 
non-HUDI source, specifically a kafka topic, and we use a custom 
implementation. Specifically, the application passes the kafka offset info to 
the `extraMetadata` field when commiting the HUDI write. Then the next run of 
the application uses HUDI APIs to read the `extraMetadata` field of the latest 
"ingestion" (non-clustering, non-compaction, etc) write instant. It then uses 
this kafka offset when reading from the kafka topic to retrieve the collection 
of "new" records to `insert` into the HUDI dataset. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [to be discussed] previous checkpoints from writes should always be available to consume [hudi]

Reply via email to