Hi Gabor, I agree that with the use of table and partition statistics (and possibly other auxiliary files in the future), this problem of orphan files due to recomputation of existing statistics (replacing existing files without deleting them) will grow. I agree that while remove_orphan_files would delete such files, it would be good to have some other mechanism for cleaning them up.
Your proposal is interesting and I think it is feasible. It would require a spec change. Can we introduce a change for this in v3? If so, I'd suggest, for handling the existing cases of table statistics and partition statistics, to introduce two fields in table metadata, `orphan-statistics` and `orphan-partition-statistics`, which will be a list of table statistics and a list of partition statistics respectively. If we want to be more general, maybe we can have `orphan-files` instead, which will also be a list. The (table) `statistics` and `partition-statistics` structs already contain `snapshot-id` fields, so I don't think we need a map of snapshot-id to file. For future use cases, where a map keyed by snapshot-id could be useful, you are already assuming the files used correspond to snapshots, so it would also make sense for the struct representing them to contain snapshot-id. When table statistics or partition statistics for a snapshot are updated, if there are existing statistics for that snapshot, the existing file needs to be written into this `orphan-*` list. I don't think we need to use the mechanism of 'write.statistics.delete-after-commit.enabled' and 'write.statistics.previous-versions-max'. I think that if we require the orphan files to be cleaned up (the list trimmed in metadata and the files deleted) during snapshot expiration, that might be enough, if snapshot expiration is run frequently enough. If we want, as an additional/optional way to clean up these orphan files, when table statistics or partition statistics for a snapshot are updated, in addition to writing an existing file into the `orphan-*` list, any file in the `orphan-*` list for the same snapshot needs to be deleted and removed from the list as well. Note that any files that already appear as `orphan` in current metadata.json are safe to remove. (We still need snapshot expiration to remove all referenced orphan files for old snapshots, but this would potentially keep the lists shorter.) However, I think this is extra. What do folks think? - Wing Yew ps. I also found that the `statistics` and `partition-statistics`fields in table metadata are lists, with the unwritten expectation (that is to say, not written into the spec) that for each snapshot, there is at most one file in the list. I also thought about the idea that we could just add updated statistics to the list without removing the existing statistics (this would be allowed by the spec) and ensuring that the first one (or last one) for a snapshot is the latest one and thus the one to use. This way, we don't need a spec change, but much existing implementation would need to change and I think it is too complicated anyway. On Thu, Feb 27, 2025 at 5:31 AM Gabor Kaszab <gaborkas...@apache.org> wrote: > Thanks for the discussion on this topic during the community sync! > Let me sum up what we discussed and also follow-up with some additional > thoughts. > > *Summary:* > As long as the table is there users can run orphan file cleanup to remove > the orphaned stat files. > If you drop the table the orphaned stat files will remain on disk > unfortunately. This is however a catalog matter for the location ownership > with the table. > > *My follow-up thoughts:* > - Orphan file cleanup is not always feasible. e.g. when tables share > locations. > - Orphan files are expected when something goes wrong. With stat files now > even successful queries could create orphan files. > - With time it seems that there are more and more new ways of creating > orphan files even in a successful use-case. Table stats, (soon coming) > partition stats and who knows what else (col stats? indexes?). The > situation might not be that severe now but could get worse over time. > - Users seem to complain even for the /data and /metadata folders > remaining on storage after a drop table. Remaining stat files could also be > a reason for recurring complaints. > > I think even though orphan file removal (if feasible) could be a solution > for the symptom here but I think the table format should offer a way to > take care of the root cause (unreferencing files when updating them) too. > > *Proposal:* > What I have in mind to tackle the root cause is to keep track not only the > current stat files but also the historical ones from the table metadata. > This could increase the size of the metadata for sure, but could be kept at > a manageable size with a similar mechanism to what we do with the > historical metadata.jsons. In practice we could have flags like: > 'write.statistics.delete-after-commit.enabled' > 'write.statistics.previous-versions-max' > > Or if we want to make this even more generic we could make these new flags > to control not just stat files but everything else we come up with (e.g. > col stats, indexes, etc.). Probably the naming could be better but > something like: > 'write.historical-files.delete-after-commit.enabled' > 'write.historical-files.previous-versions-max' > > With this we could: > - Keep a mapping of 'SnapshotID to historical files' in the table metadata > - Delete historical (stat) file(s) when updating the current one if we > exceeded the threshold history length > - Delete the historical (stat) files when expiring snapshots > - Delete the historical (stat) files when dropping the table > - Take care of all kind of files that are assigned 'offline' to a snapshot > (like stats now) if we choose a generic enough representation, like: > Map<SnapshotID, Map<FileType, List<Path>>> > (where FileType could be 'TABLE_STATS', 'PARTITION_STATS', etc.) > - Take care of historical metadata.json paths too, however that would > change the existing layout of the metadata.json files, so we most probably > don't want to do that. > > Let me know what you think! > Regards, > Gabor > > > On Mon, Feb 24, 2025 at 10:30 AM Gabor Kaszab <gaborkas...@apache.org> > wrote: > >> Hi All, >> >> `remove_orphan_files` for sure drops the previous stat files, but in case >> you drop the table they will remain on disk forever. I don't have an answer >> here (I reviewed the above mentioned PR and raised these concerns there) >> but I think we should figure out a way to avoid accumulating unreferenced >> stat files. With this we introduced another easy way of creating orphan >> files and we can't always rely on users running `remove_orphan_files` >> (maybe multiple tables sharing the same location or users not having an >> automated process of removing orphaned files?). And the introduction of >> partition stats could make the situation even worse. >> >> I'm thinking of an improvement where we drop the previous stat file if >> there is any when updating the table with a new one but in theory that >> could cause issues when some reader is currently reading that stat file. >> Not sure if there is a sophisticated solution here. >> >> I wonder what others think. >> >> Gabor >> >> On Tue, Feb 18, 2025 at 9:07 AM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >>> I believe the reason stats files allow replacing statistics with the >>> same snapshot ID is to enable the recomputation of optional stats for the >>> same snapshot. This process does leave the old stats files orphaned, but >>> they will be properly cleaned up by the `remove_orphan_files` action or >>> procedure. >>> >>> As stated in the Javadoc of `dropTableData`, the responsibility of this >>> function is solely to clean up referenced files, not orphaned ones. >>> Therefore, handling orphaned stats files within this method does not seem >>> appropriate. >>> >>> - Ajantha >>> >>> On Sat, Feb 15, 2025 at 11:29 PM Leon Lin <lianglin....@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> Recently, I came across an issue where some Puffin statistics files >>>> remain in storage after calling *dropTableData()*. >>>> >>>> The issue occurs because calling *updateStatistics()* or >>>> *updatePartitionStatistics()* on a snapshot ID that already has >>>> existing stat files commits a new metadata with the new statistics file >>>> path, without deleting the old stat files. Since the current implementation >>>> of *dropTableData()* only removes stat files referenced in the latest >>>> metadata, older stat files that were referenced in previous metadata >>>> versions remain in storage. >>>> >>>> I drafted a PR that iterates through historical metadata to collect all >>>> referenced stats files for deletion in dropTableData, but this adds >>>> significant I/O overhead. Some alternative ideas raised in the PR: >>>> >>>> - Introduce a flag to trigger iterating over historical metadata >>>> files to cleanup all stat files >>>> - Delete old stats files when updating/removing stats (risks >>>> missing files on rollback). >>>> - Track unreferenced stats files in the latest metadata (increases >>>> metadata size). >>>> >>>> Each approach comes with trade-offs in terms of performance and >>>> complexity. I hope to get some insights and opinions from the community—has >>>> this been discussed before, or is this the expected behavior? Are there any >>>> alternative approaches to handle this more efficiently? >>>> >>>> References: >>>> >>>> - PR: https://github.com/apache/iceberg/pull/12132 >>>> - Issues: https://github.com/apache/iceberg/issues/11876, >>>> https://github.com/trinodb/trino/issues/16583 >>>> - Code: >>>> >>>> https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L94-L139 >>>> >>>> Thanks, >>>> Leon >>>> >>>