During the Iceberg sync this morning, Steve suggested a PR to fix a problem with HadoopFileIO, #15111. I looked into this a bit more and it is based on #14501, which implements a Hadoop scheme where delete may actually move a file to a configured trash directory rather than deleting it. I think that this trash behavior is strange and doesn't fit into FileIO. I think the right thing to do is to probably remove it but I want to see what arguments for the behavior there are.
In my opinion, the trash behavior is confusing and not obvious for the FileIO interface. The behavior, as I understand it, is to check whether a file should actually be deleted or should just be moved to a trash folder. Interestingly, this is not done underneath the Hadoop FileSystem interface, but is a client responsibility. Since FileIO is similar to FileSystem, I think there's a strong argument that it isn't appropriate within FileIO either. But there's another argument for not having this behavior, which is that table changes and user-driven file changes are not the same. Table can churn files quite a bit and deletes shouldn't move uncommitted files to trash -- they don't need to be recovered -- nor should they move replaced or deleted data files to a trash folder that could be in a user's home directory -- this is a big and not obvious behavior change. This seems to be in conflict with reasonable governance schemes because it could leak sensitive data. Next, the use case for a trash folder is to recover from accidental deletes by users. This is unnecessary in Iceberg because tables keep their own history. Accidental data operations are easily rolled back and we have a configurable history in which you can do it. This is also already integrated cleanly so that temporary metadata files that end up not being committed are not held. In the end, I think that we don't need this because history is already kept in a better way for tables, and this feature is confusing and doesn't fit in the API. What are the use cases for keeping this? Ryan
