jackye1995 edited a comment on pull request #4342:
URL: https://github.com/apache/iceberg/pull/4342#issuecomment-1073306515


   > It seems like we're trying to overlap use cases by highjacking delete 
behavior.
   
   Thanks for the feedback Daniel! I think this is not trying to highjack the 
delete behavior. Based on what we've talked with some key customers, tagging an 
object is now the preferred S3 delete implementation in many companies. It is 
typically a sev0 in a corporation for data loss, so the data platform team 
prefers to gradually retire the delete files to different tiers and let S3 
handle the lifecycle transition.
   
   Note that this is not a discussion of table deletion/snapshot expiration 
with purge or not. For platforms that allow people to do table deletion with 
purge, I would imagine this feature to be preferred to avoid user accidentally 
running a DROP TABLE. And even for platform that does deletion without purge, 
eventually there is a team who does the "purge" part, and based on the customer 
discussions we have, this is typically what they do by tagging files and then 
define lifecycle policy so S3 takes care of the transition of files.
   
   > This also requires that the configuration of the bucket aligns with the 
tags to actually achieve the desired behavior.
   
   This can be explained in 3 parts:
   
   1. ironically, to satisfy the use case i described, many teams today have to 
enable S3 object versioning to not hard-delete data files. However, it becomes 
a huge increase in cost and users have to implement custom workflows then to 
deal with versioned objects. So some sort of bucket configuration is always 
needed for enterprise data lake setup, and this tagged approach is going to 
save cost rather than increase cost for those users.
   
   2. there are many features that won't work with the default S3 behavior and 
need bucket level settings, some other features we plan to add like S3 
acceleration mode, S3 dual stack are all like that. It's mostly done for 
backwards compatibility. We see a strong adoption of using a lifecycle policy 
to soft-delete data instead of doing a direct s3 object deletion. There are 
recently also threads in similar products like Hudi and Delta for such use 
cases. I would consider this feature to fall into such category of "new 
features that users have to opt-in" on S3 side.
   
   3. From Iceberg perspective, a file is no longer a part of the table if it 
is not a part of the Iceberg metadata tree. FileIO.delete is always used AFTER 
a table level commit that removes the file out of the metadata tree (in the 
full table deletion case it is also done after catalog operation that removes 
the table in catalog), so this semantic is still intact even if a file is not 
actually deleted for people would like to use such a feature.
   
   > Overall, it feels like this is a bit of a reach in terms of functionality.
   
   Let me know what do you think about this area of topic, if you have any 
better solutions than the proposed one. I think this is a small feature that 
could enable a lot of valuable use cases. Similar to the access point feature 
we are adding in S3FileIO, we consider that an intermediate solution which 
could be solved also at Iceberg level with larger spec change to support 
relative file path, I think there is value in offering this feature for people 
who knows how to use it, while we think of anything better that could solve 
such retention at Iceberg spec level.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to