Reo-LEI opened a new pull request #3323:
URL: https://github.com/apache/iceberg/pull/3323


   ### Description
   This PR is base on the approach of @rdblue 
[comment](https://github.com/apache/iceberg/pull/2867#issuecomment-891978223) 
and trying to incrementally rewrite committed data files.
   
   ### Implementation
   In this PR, the parallel `IcebergStreamRewriter` and single-parallelism 
`IcebergRewriteFilesCommitter` will append to `IcebergFilesCommitter`. 
   ```
                                 +-> IcebergStreamRewriter --+ 
   IcebergFilesCommitter --Hash--+-> IcebergStreamRewriter --+--Rebalance--> 
IcebergRewriteFilesCommitter
                                 +-> IcebergStreamRewriter --+ 
   ```
   - At the beginning, the committed data files and related delete files will 
be group by partition as `CommitResult` and And distribute to the 
`IcebergStreamRewriter ` according to the partition. 
   - `IcebergStreamRewriter` will collect the `CommitResult`, write a tmp 
`DeltaManifests` to reference all committed data/delete files and append this 
tmp manifest to a `DataFileGroup`. Once a file group reache a rewrite 
condition(file num / file size), the rewrite will be executed and emit a 
`RewriteResult` to `IcebergRewriteFilesCommitter`.
   - `IcebergRewriteFilesCommitter` will collect all `RewriteResult` and 
stream/batch commit it in serial.  
   
   
   ### Note
   - File group of a partition rewrite failed will not drop the file group or 
stop the flink job. `IcebergStreamRewriter` will retry rewrite this file group 
when this group have new data file append.
   -  `IcebergRewriteFilesCommitter` will trying commit all ckpt-1 rewrite 
result before ckpt(commit all remaining rewrite result in 
`prepareSnapshotPreBarrier`), to prevent the rewritten file will be modified at 
the ckpt but rewrite result not commit yet.
   
   ### Configuration
   - `flink.rewrite.max-files-count` 
   - `flink.rewrite.target-file-size-bytes`
   - `flink.rewrite.commit-groups-size` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to