[GitHub] [iceberg] jackye1995 commented on a change in pull request #3056: Support purge for Spark 3.2

GitBox Fri, 11 Feb 2022 09:52:01 -0800


jackye1995 commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r804879622




##########
File path: 
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, 
TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       My understanding is that `gc.enabled` should only impact the snapshot 
expiration (including remove orphan files) aspect, at least that is what my 
definition of “garbage collection” is. The DROP TABLE behavior of Iceberg 
external vs managed table are different discussions. We should not try to work 
around the managed table issue using `gc.enabled`.
   
   The industry has been using this notion of Managed vs External tables for a 
long time now. Most people simply associate it with the delete table behavior, 
but it really is describing the ownership of a table with respect to an engine 
with a data catalog. Just take Hive as an example, Managed table (Hive ACID) 
means table is in Hive metastore, and also the data is fully manged by the Hive 
data warehouse. Changing data using external systems will likely cause trouble, 
so they are supposed to only read data, which means to treat a table as 
External. This has been how most of the vertical integration products define 
their table concepts, including Snowflake and AWS Redshift.
   
   In the Iceberg multi-engine model where data can be updated by any compute 
engine or process through the Iceberg library, this definition of managed that 
bundles catalog and engine is confusing. But we can still make it consistent by 
changing the definition. For Iceberg, the ownership of the table is bundled 
with the catalog service plus the Iceberg spec itself. Any compute engine can 
claim the responsibility to manage a table, as long as the Iceberg spec is 
respected and the catalog service is identified for the table. In this 
definition, the issue you describe is true, if two table identifiers point to 
the same underlying table metadata in storage, that means the process operating 
against the tables are both managing the underlying files. I think this should 
be made extra clear for users who would like to use the `registerTable` feature 
we just introduced, that the table registered is being fully managed, not just 
an External table.
   
   I think an extension of `registerTable` is needed for Iceberg to complete 
the notion of external table. And it should be the catalog's responsibility to 
manage this semantics, maybe set a flag to distinguish external table from 
normal tables to make it clear that it's a read-only copy of a managed table. 
The `registerTable` interface should be updated to something like:
   
   ```
   default Table registerTable(TableIdentifier identifier, String 
metadataFileLocation, boolean isManaged) {
       throw new UnsupportedOperationException("Registering tables is not 
supported");
     }
   ```
   
   If not managed, some sort of flag should be set at catalog level, so that if 
user later tries to do a DELETE TABLE PURGE, the catalog should reject the 
operation. I think this will fully solve the issue you describe about "a user 
creating a Snapshot of another table, then dropping it with purge"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 commented on a change in pull request #3056: Support purge for Spark 3.2

Reply via email to