[DISCUSS] Remove HDFS trash behavior?

Ryan Blue Wed, 18 Feb 2026 14:48:07 -0800

During the Iceberg sync this morning, Steve suggested a PR to fix a problem
with HadoopFileIO, #15111. I looked into this a bit more and it is based on
#14501, which implements a Hadoop scheme where delete may actually move a
file to a configured trash directory rather than deleting it. I think that
this trash behavior is strange and doesn't fit into FileIO. I think the
right thing to do is to probably remove it but I want to see what arguments
for the behavior there are.


In my opinion, the trash behavior is confusing and not obvious for the
FileIO interface. The behavior, as I understand it, is to check whether a
file should actually be deleted or should just be moved to a trash folder.
Interestingly, this is not done underneath the Hadoop FileSystem interface,
but is a client responsibility. Since FileIO is similar to FileSystem, I
think there's a strong argument that it isn't appropriate within FileIO
either. But there's another argument for not having this behavior, which is
that table changes and user-driven file changes are not the same. Table can
churn files quite a bit and deletes shouldn't move uncommitted files to
trash -- they don't need to be recovered -- nor should they move replaced
or deleted data files to a trash folder that could be in a user's home
directory -- this is a big and not obvious behavior change. This seems to
be in conflict with reasonable governance schemes because it could leak
sensitive data.

Next, the use case for a trash folder is to recover from accidental deletes
by users. This is unnecessary in Iceberg because tables keep their own
history. Accidental data operations are easily rolled back and we have a
configurable history in which you can do it. This is also already
integrated cleanly so that temporary metadata files that end up not being
committed are not held.

In the end, I think that we don't need this because history is already kept
in a better way for tables, and this feature is confusing and doesn't fit
in the API. What are the use cases for keeping this?

Ryan

[DISCUSS] Remove HDFS trash behavior?

Reply via email to