[GitHub] [iceberg] jackye1995 edited a comment on issue #4159: Define behavior of gc.enabled and location ownership

GitBox Sun, 20 Feb 2022 09:57:39 -0800


jackye1995 edited a comment on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1046172956



   Thanks for raising this thread! I was also planning to raise the same 
discussion but had some thoughts to finalize. I have been thinking about this 
question quite a lot recently, here are my current thoughts:
   
   ## 1. gc.enabled
   
   I think as of today almost all the Iceberg users that tune this parameter 
are interpreting `gc.enabled=false` as "disallow removal of data and delete 
files" as @aokolnychyi suggested, so it only makes sense for us to throw 
exception in every place that deletes data of a table if `gc.enabled` is false. 
   
   I think there are only 2 cases we need to consider:
   1. remove reacheable files, which includes
     1. purge table
     2. expire snapshot
     3. any file deletion based on the Iceberg metadata tree
   2. remove orphan files
   
   For 1, we should follow the definition of `gc.enabled` and throw for all 
cases.
   For 2, see section 3, I think this is not really a table related operation.
   
   ## 2. Remove metadata files by config
   
   The current behavior in `CatalogUtil` of keeping data while removing 
metadata feels quite odd to me. If metadata is removed maybe the result data is 
still useful and can be reconstructed as a Hive table, but when object storage 
mode is enabled, it's basically not possible to track down the file locations, 
making everything just orphan files. @aokolnychyi you said:
   
   > However, we may consider allowing removal of metadata when gc.enabled is 
false. One may argue that metadata files are always owned by the table. We 
should also make our action configurable so that it can delete only data or 
metadata files.
   
   Could you provide some use cases where this is useful in addition to the 
recovery as a Hive table?
   
   ## 3. Table location ownership
   
   At a first glance, defining table prefix location is the best way to proceed 
forward, but when I think more, I start to realize that it defining ownership 
for a set of location prefixes in a table is not really needed for the use 
cases we want to achieve. Here are the 2 big use cases I considered so far:
   
   ### remove orphan files
   
   The fact that remove orphan files needs root location definition seems to be 
a circular argument. We exposes the action `remove_orphan_files(table)` with 
the assumption that table files are under the same root prefix, which works for 
Hive-like Iceberg tables. But after all orphan file removal is not a table 
operation, but a storage operation. Once a file is orphan, it no longer belongs 
to a table. We remove orphan files to save storage cost, not to make any aspect 
of the table better. We just remove by table using an assumption that is for 
Hive table, and now we try to make the Iceberg spec work with it.
   
   I think the correct way to run remove orphan files is to do it for the 
entire warehouse. I talked about my idea a bit in 
https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1645203709220099. Most 
storage services have the ability to provide the full listing of files and it 
can be delivered much more efficiently than a ListFiles API, e.g. S3 inventory 
list. And Iceberg provides an efficient way to query Iceberg files metadata 
through system table. That means we can perform an efficient distributed join 
and find out the orphan files of the entire warehouse. I think that’s all what 
the data warehouse admin needs if we provide the feature.
   
   This is basically talking about the `VACUUM` command without referring to a 
table. I think that's also what most managed data warehouse products on the 
market offer for storage cleanup. `VACUUM table` only makes sense for things 
like snapshot expiration, index cleanup, which do not rely on table root 
location mutual exclusion. If it's across storage you just do it for each. At 
least we will start to propose and provide such a feature for S3, for the other 
storage I think it's also not hard to provide an implementation once the 
interface is solidified.
   
   ### table access permission
   
   Another use case of table root location definition I can think of is for 
table access control. Admin might configure user access based on the table 
locations in storage. However, using file path access control to achieve that 
is just a storage-specific implementation which does not necessarily need to be 
true to support Iceberg table access management. For example, in S3 people can 
tag all the files it writes, and control access of that using S3 bucket policy. 
This allows multiple Iceberg tables to store files under the same bucket, but 
access control is still intact.
   
   Therefore from security perspective, table root path based file ownership 
should just be a type of ownership mode.
   
   ### The problem of declaring table location ownership
   
   From S3 object storage mode usage reports from customers, most people still 
want to store data files in the same shared root bucket to minimize throttling 
as long as the above 2 issues I describe can be solved. 
   
   If an exclusive root location has to be declared for each table, then files 
of the table has to share certain part of the S3 prefix, and the throttling 
issue comes back, where a table not accessed for a year is guaranteed to be in 
a cold S3 partition and get throttled heavily when new traffic comes.
   
   Maybe S3 can in the future solve the issue (which is unlikely because that's 
how S3 throttling and billing works no matter how the underlying architecture 
change), but my biggest concern is that people will start to build tools that 
only work for tables with declared root locations (we already see in Trino), 
which creates difference in table behavior at spec level that optimizes for a 
certain storage layout.
   
   There are also some areas that I have not finalized my thoughts yet, such as 
when data is replicated, should the replicated locations, quick access layer 
locations, archival locations also be declared as owned in Iceberg, and how 
should we provide tools to continuously track down those aspects. This feels to 
me like too much storage implementation details to handle from the table spec 
layer.
   
   I think it would be better to delegate to FileIOs to tackle those issues, as 
that is the actual vendor integration layer provided by Iceberg, where people 
choose which storage to use and what fits their bill, and storage frameworks 
and products can compete with the same set of rules. At least all the features 
I described for S3 are planned to be added, and I think there are GCS and HDFS 
equivalents that people can implement if feeling the need to reach feature 
parity.
   
   One valuable thing to add in the Iceberg spec is the list (or set?) of all 
the table locations used. I think that could be used by a specific storage to 
do whatever is needed based on the information, such as removing all data in 
all directory. But in general we should be cautious about saying that all the 
locations are owned by a table.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 edited a comment on issue #4159: Define behavior of gc.enabled and location ownership

Reply via email to