kbuci commented on issue #17848: URL: https://github.com/apache/hudi/issues/17848#issuecomment-3780686278
Sure let me add some context, the two "checkpoints" we use internally are - For some jobs we use deltastreamer. Which means they implicitly use HoodieStreamer's internal checkpoint , that you mentioned. I believe it is called `deltastreamer.checkpoint.key` - For other jobs we use HUDI spark batch writer to read records from a non-HUDI source, specifically a kafka topic, and we use a custom implementation. Specifically, the application passes the kafka offset info to the `extraMetadata` field when commiting the HUDI write. Then the next run of the application uses HUDI APIs to read the `extraMetadata` field of the latest "ingestion" (non-clustering, non-compaction, etc) write instant. It then uses this kafka offset when reading from the kafka topic to retrieve the collection of "new" records to `insert` into the HUDI dataset. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
