danielcweeks commented on PR #14501: URL: https://github.com/apache/iceberg/pull/14501#issuecomment-3548976534
@jordepic and @ludlows After looking a little more into the way trash works, I don't think this is something we want to turn on at a table level (especially considering how this implementation works). The Trash feature in Hadoop/HDFS is quite strange as it's a client, config, and cluster level feature that all depend upon each other. For example, the client has to respect the config and initialize the Trash and perform a move operation otherwise it's ignored. The config has to be set and configured properly to a location the user has access to. Finally, if you don't apply the configuration to both the client and the NameNode, then cleanup won't be performed properly. Given all of that, this feels very much like a administrator-level feature that needs to be configured (this appears to be the case for Cloudera already, though I don't know if engines like Hive/Impala respect the trash settings). It could be potentially dangerous to allow users to configure this on a per-table basis because cleanup may not be configured, which may result in data that should be deleted, persisting in the file system. There's also nothing that appears to prevent the configuration from being applied to other file-system implementations (like S3A), which would be bad (data copy, no cleanup), but I feel like we should discourage that. @jordepic Is there anything we can do to prevent this? I'm not a huge fan of this approach, but it seems like what we have to work with. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
