myandpr opened a new pull request #3213: URL: https://github.com/apache/iceberg/pull/3213
## What problem does this pr solved? when we write data into icebeg table using flink, it will produce a lot of small files. This branch supports compacting small files and deleting expired snapshots automatically when writing data into iceberg table for flink. ## How was this problem solved? **We added the following configuration** ``` // auto compact write.auto-compact.enabled write.compact.interval-ms write.compact.target-file-size-bytes write.compact.small-file-size-bytes (small file size threshold) write.compact.small-file-nums (small file number threshold) // auto expire snapshots snapshot.auto-expire.enabled snapshot.auto-expire.interval-ms snapshot.auto-expire.max-snapshot-age-ms snapshot.auto-expire.min-snapshots-to-keep snapshot.auto-expire.snapshots-group-nums ``` In this PR, we add 3 operators(CompactFileGenerator, CompactFileOperator, CompactFileCommitter) for compaction function and 2 operators(ExpireSnapshotGenerator, ExpireSnapshotOperator) for expire snapshots function after IcebergFilesCommitter. **Compact function:**<br/> 1. In IcebergFilesCommitter.notifyCheckpointComplete(), we emit a 'EndCheckpoint' message to downstream operator 'CompactFileGenerator'. <br/> 2. ‘CompactFileGenerator’ generates all 'CombinedScanTask' that needs to be rewrited and distributes them to downstream operator 'CompactFileOperator'. <br/> 3. 'CompactFileOperator' starts compacting files when receiving upstream 'CombinedScanTask' and emit compaction result to downstream operator 'CompactFileCommitter'. <br/> 4. 'CompactFileCommitter' commits a compaction transaction and emit 'EndCheckpoint' message to downstream 'ExpireSnapshotGenerator' when receiving all compaction results from upstream. **expire snapshot function:**<br/> 1. 'ExpireSnapshotGenerator' generates all files(manifest list file, menifest file, data file) that needs to be deleted and distributes them to downstream operator 'ExpireSnapshotOperator'. <br/> 2. 'ExpireSnapshotOperator' start deleting all files received upstream. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
