jackye1995 commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r804879622
##########
File path:
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident,
TableChange... changes) throws No
@Override
public boolean dropTable(Identifier ident) {
+ return dropTableInternal(ident, false);
+ }
+
+ @Override
+ public boolean purgeTable(Identifier ident) {
+ return dropTableInternal(ident, true);
+ }
+
+ private boolean dropTableInternal(Identifier ident, boolean purge) {
try {
return isPathIdentifier(ident) ?
- tables.dropTable(((PathIdentifier) ident).location()) :
- icebergCatalog.dropTable(buildIdentifier(ident));
+ tables.dropTable(((PathIdentifier) ident).location(), purge) :
Review comment:
My understanding is that `gc.enabled` should only impact the snapshot
expiration (including remove orphan files) aspect, at least that is what my
definition of “garbage collection” is. The DROP TABLE behavior of Iceberg
external vs managed table are different discussions. We should not try to work
around the managed table issue using `gc.enabled`.
The industry has been using this notion of Managed vs External tables for a
long time now. Most people simply associate it with the delete table behavior,
but it really is describing the ownership of a table with respect to an engine
with a data catalog. Just take Hive as an example, Managed table (Hive ACID)
means table is in Hive metastore, and also the data is fully manged by the Hive
data warehouse. Changing data using external systems will likely cause trouble,
so they are supposed to only read data, which means to treat a table as
External. This has been how most of the vertical integration products define
their table concepts, including Snowflake and AWS Redshift.
In the Iceberg multi-engine model where data can be updated by any compute
engine or process through the Iceberg library, this definition of managed that
bundles catalog and engine is confusing. But we can still make it consistent by
changing the definition. For Iceberg, the ownership of the table is bundled
with the catalog service plus the Iceberg spec itself. Any compute engine can
claim the responsibility to manage a table, as long as the Iceberg spec is
respected and the catalog service is identified for the table. In this
definition, the issue you describe is true, if two table identifiers point to
the same underlying table metadata in storage, that means the process operating
against the tables are both managing the underlying files. I think this should
be made extra clear for users who would like to use the `registerTable` feature
we just introduced, that the table registered is being fully managed, not just
an External table.
I think an extension of `registerTable` is needed for Iceberg to complete
the notion of external table. And it should be the catalog's responsibility to
manage this semantics, maybe set a flag to distinguish external table from
normal tables to make it clear that it's a read-only copy of a managed table.
The `registerTable` interface should be updated to something like:
```
default Table registerTable(TableIdentifier identifier, String
metadataFileLocation, boolean isManaged) {
throw new UnsupportedOperationException("Registering tables is not
supported");
}
```
If not managed, some sort of flag should be set at catalog level, so that if
user later tries to do a DELETE TABLE PURGE, the catalog should reject the
operation. I think this will fully solve the issue you describe about "a user
creating a Snapshot of another table, then dropping it with purge"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]