jackye1995 edited a comment on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1046172956
Thanks for raising this thread! I was also planning to raise the same
discussion but had some thoughts to finalize. I have been thinking about this
question quite a lot recently, here are my current thoughts:
## 1. gc.enabled
I think as of today almost all the Iceberg users that tune this parameter
are interpreting `gc.enabled=false` as "disallow removal of data and delete
files" as @aokolnychyi suggested, so it only makes sense for us to throw
exception in every place that deletes data of a table if `gc.enabled` is false.
I think there are only 2 cases we need to consider:
1. remove reacheable files, which includes
1. purge table
2. expire snapshot
3. any file deletion based on the Iceberg metadata tree
2. remove orphan files
For 1, we should follow the definition of `gc.enabled` and throw for all
cases.
For 2, see section 3, I think this is not really a table related operation.
## 2. Remove metadata files by config
The current behavior in `CatalogUtil` of keeping data while removing
metadata feels quite odd to me. If metadata is removed maybe the result data is
still useful and can be reconstructed as a Hive table, but when object storage
mode is enabled, it's basically not possible to track down the file locations,
making everything just orphan files. @aokolnychyi you said:
> However, we may consider allowing removal of metadata when gc.enabled is
false. One may argue that metadata files are always owned by the table. We
should also make our action configurable so that it can delete only data or
metadata files.
Could you provide some use cases where this is useful in addition to the
recovery as a Hive table?
## 3. Table location ownership
At a first glance, defining table prefix location is the best way to proceed
forward, but when I think more, I start to realize that it defining ownership
for a set of location prefixes in a table is not really needed for the use
cases we want to achieve. Here are the 2 big use cases I considered so far:
### remove orphan files
The fact that remove orphan files needs root location definition seems to be
a circular argument. We exposes the action `remove_orphan_files(table)` with
the assumption that table files are under the same root prefix, which works for
Hive-like Iceberg tables. But after all orphan file removal is not a table
operation, but a storage operation. Once a file is orphan, it no longer belongs
to a table. We remove orphan files to save storage cost, not to make any aspect
of the table better. We just remove by table using an assumption that is for
Hive table, and now we try to make the Iceberg spec work with it.
I think the correct way to run remove orphan files is to do it for the
entire warehouse. I talked about my idea a bit in
https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1645203709220099. Most
storage services have the ability to provide the full listing of files and it
can be delivered much more efficiently than a ListFiles API, e.g. S3 inventory
list. And Iceberg provides an efficient way to query Iceberg files metadata
through system table. That means we can perform an efficient distributed join
and find out the orphan files of the entire warehouse. I think that’s all what
the data warehouse admin needs if we provide the feature.
This is basically talking about the `VACUUM` command without referring to a
table. I think that's also what most managed data warehouse products on the
market offer for storage cleanup. `VACUUM table` only makes sense for things
like snapshot expiration, index cleanup, which do not rely on table root
location mutual exclusion. If it's across storage you just do it for each. At
least we will start to propose and provide such a feature for S3, for the other
storage I think it's also not hard to provide an implementation once the
interface is solidified.
### table access permission
Another use case of table root location definition I can think of is for
table access control. Admin might configure user access based on the table
locations in storage. However, using file path access control to achieve that
is just a storage-specific implementation which does not necessarily need to be
true to support Iceberg table access management. For example, in S3 people can
tag all the files it writes, and control access of that using S3 bucket policy.
This allows multiple Iceberg tables to store files under the same bucket, but
access control is still intact.
Therefore from security perspective, table root path based file ownership
should just be a type of ownership mode.
### The problem of declaring table location ownership
From S3 object storage mode usage reports from customers, most people still
want to store data files in the same shared root bucket to minimize throttling
as long as the above 2 issues I describe can be solved.
If an exclusive root location has to be declared for each table, then files
of the table has to share certain part of the S3 prefix, and the throttling
issue comes back, where a table not accessed for a year is guaranteed to be in
a cold S3 partition and get throttled heavily when new traffic comes.
Maybe S3 can in the future solve the issue (which is unlikely because that's
how S3 throttling and billing works no matter how the underlying architecture
change), but my biggest concern is that people will start to build tools that
only work for tables with declared root locations (we already see in Trino),
which creates difference in table behavior at spec level that optimizes for a
certain storage layout.
There are also some areas that I have not finalized my thoughts yet, such as
when data is replicated, should the replicated locations, quick access layer
locations, archival locations also be declared as owned in Iceberg, and how
should we provide tools to continuously track down those aspects. This feels to
me like too much storage implementation details to handle from the table spec
layer.
I think it would be better to delegate to FileIOs to tackle those issues, as
that is the actual vendor integration layer provided by Iceberg, where people
choose which storage to use and what fits their bill, and storage frameworks
and products can compete with the same set of rules. At least all the features
I described for S3 are planned to be added, and I think there are GCS and HDFS
equivalents that people can implement if feeling the need to reach feature
parity.
One valuable thing to add in the Iceberg spec is the list (or set?) of all
the table locations used. I think that could be used by a specific storage to
do whatever is needed based on the information, such as removing all data in
all directory. But in general we should be cautious about saying that all the
locations are owned by a table.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]