rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805069054
##########
File path:
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident,
TableChange... changes) throws No
@Override
public boolean dropTable(Identifier ident) {
+ return dropTableInternal(ident, false);
+ }
+
+ @Override
+ public boolean purgeTable(Identifier ident) {
+ return dropTableInternal(ident, true);
+ }
+
+ private boolean dropTableInternal(Identifier ident, boolean purge) {
try {
return isPathIdentifier(ident) ?
- tables.dropTable(((PathIdentifier) ident).location()) :
- icebergCatalog.dropTable(buildIdentifier(ident));
+ tables.dropTable(((PathIdentifier) ident).location(), purge) :
Review comment:
@jackye1995's summary of external is helpful. Thanks for helping set the
context!
The confusion here often lies with users that have been taught to expect
certain outcomes from external tables, like that external tables don't remove
data when they're dropped. Those guarantees are inconsistent in the first
place: you can still overwrite (remove files) in an external table in Hive. In
addition, why should a user set whether a table's data can be deleted when it
is dropped at the time the table is created?
How to clean up when dropping a table depends on the choices in the data
platform. At Netflix, we removed references only and deleted data files only
after a waiting period (to allow restoring a table). Users had no say in how
the data was removed, only that the table was logically dropped. Both
`EXTERNAL` and `PURGE` makes little sense with policies like that.
> I think an extension of registerTable is needed for Iceberg to complete
the notion of external table. And it should be the catalog's responsibility to
manage this semantics, maybe set a flag to distinguish external table from
normal tables to make it clear that it's a read-only copy of a managed table.
I really don't think that the concept of `EXTERNAL` has a place in Iceberg
and I am very skeptical that we should add it. `EXTERNAL` essentially means
that the table is owned by some other system. But any engine can use an Iceberg
catalog and the library to update a table. What is the utility of arbitrarily
restricting that?
As Jack said, we can think of the relationship differently, removing the
engine requirement. But if we still have `EXTERNAL` in this world, then it is
really just to support secondary references. I don't know why you would want
secondary references that look exactly like primary references but are
restricted from making changes. Just connect to the catalog. And if there's a
need, we can add a catalog flag that makes it read-only.
That still leaves an open question of what to do for `gc.enabled`. There are
times when a table doesn't "own" the resources within it and `gc.enabled` is a
quick way to account for that -- to basically turn off operations that may
delete files that aren't owned. I think that includes PURGE, but I do see the
distinction and agree that it is an ugly choice to need to make. Short term, I
think we should disable PURGE and throw an exception if `gc.enabled` is true.
Longer term, I think we need to refine the idea of file ownership. The best
way that I can think of is to do this by prefix, as we discussed on a thread in
the Trino community a while back and in the Iceberg sync. In this model, a
table gets its own prefix and can delete anything in that prefix (or more
likely, a small set of prefixes). Imported data files are unowned and are not
deleted by expire snapshots or purge if they are outside of owned prefixes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]