Hi Chen, The "replace" operation indicates that although the files in a table changed, the actual table data did not. Queries should produce the same results, if they are deterministic. That's why we use it for file compaction: although we replace small files with fewer, smaller files, the overall contents doesn't change. To your question, yes, that's what the RewriteFiles API does.
All operations change the current table state. Each snapshot of a table is a complete set of the data files that make up the table, and snapshots are immutable. So you can't go back and change a snapshot from yesterday. What you can do is replace small files in the current state with a compacted large file. That creates a new snapshot that is used from then on. The small files are still referenced and available as long as the old snapshot exists, which is why snapshots should be cleaned up regularly with ExpireSnapshots. That will delete files that are no longer referenced. We file referenced by old snapshots around for a couple reasons. First, readers that started with a different current snapshot may still be reading them. Second, it allows you to go back and read the table at an older point in time -- time-travel queries. I hope that helps, rb On Fri, Jun 26, 2020 at 10:39 AM Chen Song <chen.song...@gmail.com> wrote: > Hey > > In Iceberg documentation, it mentions to use this for compaction > <https://iceberg.apache.org/spec/#snapshots>. I have a few questions on > compaction. > > Is this (replace) referring to this RewriteFiles API > <https://iceberg.apache.org/javadoc/master/org/apache/iceberg/RewriteFiles.html> > ? > If so, it looks like it only applies to the most recent snapshot of data? > Is there a way to compact data belonging to old snapshots? e.g., if I want > to rewrite data for older data with newer partition spec? > > Thanks for the help in advance. > > -- > Chen Song > > -- Ryan Blue Software Engineer Netflix