myandpr opened a new pull request #3213:
URL: https://github.com/apache/iceberg/pull/3213


   ## What problem does this pr solved?
   when we write data into icebeg table using flink, it will produce  a lot of 
small files. 
   
   This branch supports compacting small files and deleting expired snapshots 
automatically when writing data into iceberg table for flink. 
   
   ## How was this problem solved?
   **We added the following configuration**
   ```
   // auto compact
   write.auto-compact.enabled
   write.compact.interval-ms
   write.compact.target-file-size-bytes
   write.compact.small-file-size-bytes   (small file size threshold)
   write.compact.small-file-nums          (small file number threshold)
   
   // auto expire snapshots
   snapshot.auto-expire.enabled
   snapshot.auto-expire.interval-ms
   snapshot.auto-expire.max-snapshot-age-ms
   snapshot.auto-expire.min-snapshots-to-keep
   snapshot.auto-expire.snapshots-group-nums
   ```
   
   In this PR, we add 3 operators(CompactFileGenerator, CompactFileOperator, 
CompactFileCommitter) for compaction function and 2 
operators(ExpireSnapshotGenerator, ExpireSnapshotOperator) for expire snapshots 
function after IcebergFilesCommitter.
   
   **Compact function:**<br/>
   1. In IcebergFilesCommitter.notifyCheckpointComplete(), we emit a 
'EndCheckpoint' message to downstream operator 'CompactFileGenerator'.  <br/>
   2. ‘CompactFileGenerator’ generates all 'CombinedScanTask' that needs to be 
rewrited  and distributes them to downstream  operator 'CompactFileOperator'.   
<br/>
   3. 'CompactFileOperator' starts compacting files when receiving upstream 
'CombinedScanTask' and emit compaction result to downstream  operator 
'CompactFileCommitter'. <br/>
   4. 'CompactFileCommitter' commits a compaction transaction and emit 
'EndCheckpoint' message to downstream 'ExpireSnapshotGenerator' when receiving 
all compaction results from upstream.
   
   **expire snapshot function:**<br/>
   1. 'ExpireSnapshotGenerator' generates all files(manifest list file, 
menifest file, data file) that  needs to be deleted  and distributes them to 
downstream operator 'ExpireSnapshotOperator'. <br/>
   2. 'ExpireSnapshotOperator' start deleting all files received upstream.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to