nsivabalan opened a new pull request, #8570:
URL: https://github.com/apache/hudi/pull/8570

   ### Change Logs
   
   With spark datasource writes, if same batch of data is being ingested, hudi 
will re-ingest them as though its a new batch of data. But if users prefer to 
skip certain batch, this patch adds such support. Two additional configs are 
introduced, namely `hoodie.datasource.write.writer.identifier` and 
`hoodie.datasource.write.batch.identifier`. 
`hoodie.datasource.write.writer.identifier` uniquely identifies a writer and 
`hoodie.datasource.write.batch.identifier` refers to identifier of the batch 
and it has to be monotonically increasing for subsequent batches from the same 
writer. So, if users prefers to leverage this feature, they need to set these 
two configs. If hudi detects duplicate batches being ingested, it will skip 
ingesting again. If not set, every batch of data will be ingested. 
   Users can also leverage this with streaming source if they are writing to 
hudi via forEachBatch() api. For a user writing to hudi using StreamingSink, 
idempotency is automatically handled and users don't need to set any additional 
configs. 
   
   ### Impact
   
   Will assist users to ensure their spark datasource writes to hudi is 
idempotent. 
   
   ### Risk level (write none, low medium or high below)
   
   low.
   
   ### Documentation Update
   
   Our configuration update should take care of it.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to