nsivabalan opened a new pull request, #8570: URL: https://github.com/apache/hudi/pull/8570
### Change Logs With spark datasource writes, if same batch of data is being ingested, hudi will re-ingest them as though its a new batch of data. But if users prefer to skip certain batch, this patch adds such support. Two additional configs are introduced, namely `hoodie.datasource.write.writer.identifier` and `hoodie.datasource.write.batch.identifier`. `hoodie.datasource.write.writer.identifier` uniquely identifies a writer and `hoodie.datasource.write.batch.identifier` refers to identifier of the batch and it has to be monotonically increasing for subsequent batches from the same writer. So, if users prefers to leverage this feature, they need to set these two configs. If hudi detects duplicate batches being ingested, it will skip ingesting again. If not set, every batch of data will be ingested. Users can also leverage this with streaming source if they are writing to hudi via forEachBatch() api. For a user writing to hudi using StreamingSink, idempotency is automatically handled and users don't need to set any additional configs. ### Impact Will assist users to ensure their spark datasource writes to hudi is idempotent. ### Risk level (write none, low medium or high below) low. ### Documentation Update Our configuration update should take care of it. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
