jackye1995 edited a comment on pull request #4342: URL: https://github.com/apache/iceberg/pull/4342#issuecomment-1073306515
> It seems like we're trying to overlap use cases by highjacking delete behavior. Thanks for the feedback Daniel! I think this is not trying to highjack the delete behavior. Based on what we've heard from some key customers, tagging an object is now the preferred S3 delete implementation in many companies. It is typically a sev0 in a corporation for data loss, so the data platform team prefers to gradually retire the delete files to different tiers and let S3 handle the lifecycle transition. Note that this is not a discussion of table deletion/snapshot expiration with purge or not. For platforms that allow people to do table deletion with purge, I would imagine this feature to be preferred where users could accidentally run a DROP TABLE. And even for platform that does deletion without purge, eventually there is a team who does the "purge" part, and based on the customer discussions we have, this is typically what they do by tagging files and then define lifecycle policy so S3 takes care of the transition of files. > This also requires that the configuration of the bucket aligns with the tags to actually achieve the desired behavior. This can be explained in 3 parts: 1. ironically, to satisfy the use case i described, many teams today have to enable S3 object versioning to not hard-delete data files. However, it becomes a huge increase in cost and users have to implement custom workflows then to deal with versioned objects. So some sort of bucket configuration is always needed for enterprise data lake setup, and this tagged approach is going to save cost rather than increase cost for those users. 2. there are many features that won't work with the default S3 behavior and need bucket level settings, some other features we plan to add like S3 acceleration mode, S3 dual stack are all like that. It's mostly done for backwards compatibility. We see a strong adoption of using a lifecycle policy to soft-delete data instead of doing a direct s3 object deletion. There are recently also threads in similar products like Hudi and Delta for such use cases. I would consider this feature to fall into such category of "new features that users have to opt-in" on S3 side. 3. From Iceberg perspective, a file is no longer a part of the table if it is not a part of the Iceberg metadata tree. FileIO.delete is always used AFTER a table level commit that removes the file out of the metadata tree (in the full table deletion case it is also done after catalog operation that removes the table in catalog), so this semantic is still intact even if a file is not actually deleted for people would like to use such a feature. > Overall, it feels like this is a bit of a reach in terms of functionality. Let me know what do you think about this area of topic, if you have any better solutions than the proposed one. I think this is a small feature that could enable a lot of valuable use cases. Similar to the access point feature we are adding in S3FileIO, we consider that an intermediate solution which could be solved also at Iceberg level with larger spec change to support relative file path, I think there is value in offering this feature for people who knows how to use it, while we think of anything better that could solve such retention at Iceberg spec level. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
