Reo-LEI opened a new pull request #3323: URL: https://github.com/apache/iceberg/pull/3323
### Description This PR is base on the approach of @rdblue [comment](https://github.com/apache/iceberg/pull/2867#issuecomment-891978223) and trying to incrementally rewrite committed data files. ### Implementation In this PR, the parallel `IcebergStreamRewriter` and single-parallelism `IcebergRewriteFilesCommitter` will append to `IcebergFilesCommitter`. ``` +-> IcebergStreamRewriter --+ IcebergFilesCommitter --Hash--+-> IcebergStreamRewriter --+--Rebalance--> IcebergRewriteFilesCommitter +-> IcebergStreamRewriter --+ ``` - At the beginning, the committed data files and related delete files will be group by partition as `CommitResult` and And distribute to the `IcebergStreamRewriter ` according to the partition. - `IcebergStreamRewriter` will collect the `CommitResult`, write a tmp `DeltaManifests` to reference all committed data/delete files and append this tmp manifest to a `DataFileGroup`. Once a file group reache a rewrite condition(file num / file size), the rewrite will be executed and emit a `RewriteResult` to `IcebergRewriteFilesCommitter`. - `IcebergRewriteFilesCommitter` will collect all `RewriteResult` and stream/batch commit it in serial. ### Note - File group of a partition rewrite failed will not drop the file group or stop the flink job. `IcebergStreamRewriter` will retry rewrite this file group when this group have new data file append. - `IcebergRewriteFilesCommitter` will trying commit all ckpt-1 rewrite result before ckpt(commit all remaining rewrite result in `prepareSnapshotPreBarrier`), to prevent the rewritten file will be modified at the ckpt but rewrite result not commit yet. ### Configuration - `flink.rewrite.max-files-count` - `flink.rewrite.target-file-size-bytes` - `flink.rewrite.commit-groups-size` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
