myandpr opened a new pull request #3255:
URL: https://github.com/apache/iceberg/pull/3255


   ## What problem does this pr solved?
   when we write data into icebeg table using flink, it will produce  a lot of 
small files. 
   Merely compacting small files cannot reduce the number of small files. 
   Only after compacting the small files and deleting expired snapshot files 
can we really reduce the number of small files overall.
   
   This branch supports deleting expired snapshots automatically when writing 
data into iceberg table in flink. 
   
   ## How was this problem solved?
   **We added the following configuration**
   ```
   // auto expire snapshots
   snapshot.flink.auto-expire.enabled
   snapshot.flink.auto-expire.interval-ms
   snapshot.flink.auto-expire.max-snapshot-age-ms
   snapshot.flink.auto-expire.min-snapshots-to-keep
   snapshot.flink.auto-expire.snapshots-group-nums
   ```
   
   In this PR, we add 2 operators(ExpireSnapshotGenerator, 
ExpireSnapshotOperator) for expire snapshots function after 
IcebergFilesCommitter.
   
   
   **expire snapshot function:**<br/>
   1. In IcebergFilesCommitter.notifyCheckpointComplete(), we emit a 
'EndCheckpoint' message to downstream operator 'ExpireSnapshotGenerator'.  <br/>
   2. 'ExpireSnapshotGenerator' generates all files(manifest list file, 
menifest file, data file) that  needs to be deleted  and distributes them to 
downstream operator 'ExpireSnapshotOperator'. <br/>
   3. 'ExpireSnapshotOperator' start to actually delete all files received from 
upstream in parallel.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to