jasonf20 opened a new pull request, #9323:
URL: https://github.com/apache/iceberg/pull/9323
**Explanation**
Certain data production patterns can result in a bunch of micro-batch
updates that need to be applied to the table sequentially. If these batches
include updates they need to be committed to matching data sequence numbers to
only apply the deletes of each batch to the previous batches.
Currently, this is achievable by creating a transaction and committing each
batch
```
for batch in batches:
delta = transaction.newRowDelta()
delta.add(batch.deletes)
delta.add(batch.inserts)
delta.commit()
transaction.commit()
```
However, this is very slow since it produces a manifest file for each batch
and writes that file out to the filesystem.
Instead, I propose an API that produces a single manifest with files at
different data sequence numbers (like you would get after re-writing the
manifests) immediately.
```
update = table.newStreamingUpdate()
for batch, batchIndex in enumerate(batches):
update.newBatch()
update.add(batch.deleteFiles)
update.add(batch.dataFiles)
update.commit()
```
The API will produce 1 delete file and 1 data file manifest (or more if it
gets too large) where each batch advances the data sequence number by 1. This
way :
* Deletes for previous batches don't apply to new data.
* Deletes do apply for all data written before the delete.
This PR adds this API.
I will add a sample benchmark in the first comment
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]