Amazon classic S3 has versioning. One thing which could be considered would be to take the manifest of a specific version of a table, enumerate the s3 version of it and all the files it then references, to producea a list of the version IDs of the files which a version of a table referenced.
Given that information, the next bit of work is to regenerate a (different) table from that data. You would not want to restore each object (that'd break anything referring to a later version), but to read that underlying version and write it elsewhere. Actually, if that no filename was ever recycled, the old versions could be restored and a recovered table set up with the old files as it...it'd be a lot simpler. Is that 100% true, always? On Mon, 23 Feb 2026 at 22:18, Ryan Blue <[email protected]> wrote: > I merged the PR to revert this since I don't think anyone is strongly for > keeping it. I also think Steve is right that if we have NN pressure we > would want to use a bulk endpoint and that it won't be better to use > renames. > > The original author also confirmed on the PR that they can use a custom > FileIO for this and don't need it to be in Iceberg. That use case was > around having a way to undo bad orphan file cleanup, which would delete > files underneath the table. I don't think that's really an Iceberg > responsibility, again because if it were it would be built into the Hadoop > FileSystem rather than the FileIO layer above. > > It would also be a good idea to think about how to alternatively address > those cases. I think replicas are a good way to address it that are going > to be easier to produce in v4 (with relative paths) but other ideas are > definitely welcome here! > > On Mon, Feb 23, 2026 at 3:07 AM Steve Loughran <[email protected]> > wrote: > >> >> On Sat, 21 Feb 2026 at 09:02, Cheng Pan <[email protected]> wrote: >> >>> Share a use case of HDFS Trash - deleting a directory on HDFS that has >>> tons of files might cause significant pressure on the NameNode and >>> slow the HDFS cluster for dozens of minutes, while moving to Trash is >>> relatively cheap, then those files can be deleted in the background >>> after reaching expiration time, in small batches, thus no pressure and >>> latency on the NameNode. >>> >>> >> iceberg is only to be deleting files though, not directories; it''ll be >> acquiring a lock per file for a delete, and for a rename needs to get a >> lock of ~.Trash too. I don't see it being any worse here. >> >> Now, if you were to add bulk delete support to hdfs, we could send a >> single RPC there with a batch of files and hdfs could go through them and >> delete in turn, failing if a dir was encountered.And like the s3a >> implementation, it could be throttled: you'd implement that on the server >> before actually acquiring any locks so all callers of bulk delete would be >> constrained >> >> >> >>> If possible, I would still like Iceberg to have this feature. >>> >>> Thanks, >>> Cheng Pan >>> >>> On Fri, Feb 20, 2026 at 3:22 AM Daniel Weeks <[email protected]> wrote: >>> > >>> > I agree with Steve and Ryan on this. >>> > >>> > I was a bit critical of all the issues with configuration and behavior >>> when reviewing the PR, but felt that containing it to HDFS might make it >>> reasonable to close the gap in behavior between Hive tables and Iceberg. >>> > >>> > However, it is complicated, messy and could cause surprising behavior >>> for anyone who has it turned on in their environment when it suddenly >>> starts being respected causing lots of trash behavior. >>> > >>> > I'll open a PR to revert and reach out to the original author. >>> > >>> > -Dan >>> > >>> > On Thu, Feb 19, 2026 at 11:14 AM Steve Loughran <[email protected]> >>> wrote: >>> >> >>> >> >>> >> I'm very happy with removing support; it just complicates the code >>> for a failure condition "accidental deletion" which shouldn't surface. >>> >> >>> >> The only times where the users may want to roll back a delete is DROP >>> TABLE, and there it's the homework of the catalog to give users a way to >>> revert it. >>> >> >>> >> It's not shipped yet so removal is not a regression at all. >>> >> >>> >> steve >>> >> >>> >> >>> >> On Wed, 18 Feb 2026 at 22:48, Ryan Blue <[email protected]> wrote: >>> >>> >>> >>> During the Iceberg sync this morning, Steve suggested a PR to fix a >>> problem with HadoopFileIO, #15111. I looked into this a bit more and it is >>> based on #14501, which implements a Hadoop scheme where delete may actually >>> move a file to a configured trash directory rather than deleting it. I >>> think that this trash behavior is strange and doesn't fit into FileIO. I >>> think the right thing to do is to probably remove it but I want to see what >>> arguments for the behavior there are. >>> >>> >>> >>> In my opinion, the trash behavior is confusing and not obvious for >>> the FileIO interface. The behavior, as I understand it, is to check whether >>> a file should actually be deleted or should just be moved to a trash >>> folder. Interestingly, this is not done underneath the Hadoop FileSystem >>> interface, but is a client responsibility. Since FileIO is similar to >>> FileSystem, I think there's a strong argument that it isn't appropriate >>> within FileIO either. But there's another argument for not having this >>> behavior, which is that table changes and user-driven file changes are not >>> the same. Table can churn files quite a bit and deletes shouldn't move >>> uncommitted files to trash -- they don't need to be recovered -- nor should >>> they move replaced or deleted data files to a trash folder that could be in >>> a user's home directory -- this is a big and not obvious behavior change. >>> This seems to be in conflict with reasonable governance schemes because it >>> could leak sensitive data. >>> >>> >>> >>> Next, the use case for a trash folder is to recover from accidental >>> deletes by users. This is unnecessary in Iceberg because tables keep their >>> own history. Accidental data operations are easily rolled back and we have >>> a configurable history in which you can do it. This is also already >>> integrated cleanly so that temporary metadata files that end up not being >>> committed are not held. >>> >>> >>> >>> In the end, I think that we don't need this because history is >>> already kept in a better way for tables, and this feature is confusing and >>> doesn't fit in the API. What are the use cases for keeping this? >>> >>> >>> >>> Ryan >>> >>
