myandpr opened a new pull request #3255: URL: https://github.com/apache/iceberg/pull/3255
## What problem does this pr solved? when we write data into icebeg table using flink, it will produce a lot of small files. Merely compacting small files cannot reduce the number of small files. Only after compacting the small files and deleting expired snapshot files can we really reduce the number of small files overall. This branch supports deleting expired snapshots automatically when writing data into iceberg table in flink. ## How was this problem solved? **We added the following configuration** ``` // auto expire snapshots snapshot.flink.auto-expire.enabled snapshot.flink.auto-expire.interval-ms snapshot.flink.auto-expire.max-snapshot-age-ms snapshot.flink.auto-expire.min-snapshots-to-keep snapshot.flink.auto-expire.snapshots-group-nums ``` In this PR, we add 2 operators(ExpireSnapshotGenerator, ExpireSnapshotOperator) for expire snapshots function after IcebergFilesCommitter. **expire snapshot function:**<br/> 1. In IcebergFilesCommitter.notifyCheckpointComplete(), we emit a 'EndCheckpoint' message to downstream operator 'ExpireSnapshotGenerator'. <br/> 2. 'ExpireSnapshotGenerator' generates all files(manifest list file, menifest file, data file) that needs to be deleted and distributes them to downstream operator 'ExpireSnapshotOperator'. <br/> 3. 'ExpireSnapshotOperator' start to actually delete all files received from upstream in parallel. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
