[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.2

GitBox Fri, 11 Feb 2022 15:29:38 -0800


rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805069054




##########
File path: 
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, 
TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       @jackye1995's summary of external is helpful. Thanks for helping set the 
context!
   
   The confusion here often lies with users that have been taught to expect 
certain outcomes from external tables, like that external tables don't remove 
data when they're dropped. Those guarantees are inconsistent in the first 
place: you can still overwrite (remove files) in an external table in Hive. In 
addition, why should a user set whether a table's data can be deleted when it 
is dropped at the time the table is created?
   
   How to clean up when dropping a table depends on the choices in the data 
platform. At Netflix, we removed references only and deleted data files only 
after a waiting period (to allow restoring a table). Users had no say in how 
the data was removed, only that the table was logically dropped. Both 
`EXTERNAL` and `PURGE` makes little sense with policies like that.
   
   > I think an extension of registerTable is needed for Iceberg to complete 
the notion of external table. And it should be the catalog's responsibility to 
manage this semantics, maybe set a flag to distinguish external table from 
normal tables to make it clear that it's a read-only copy of a managed table.
   
   I really don't think that the concept of `EXTERNAL` has a place in Iceberg 
and I am very skeptical that we should add it. `EXTERNAL` essentially means 
that the table is owned by some other system. But any engine can use an Iceberg 
catalog and the library to update a table. What is the utility of arbitrarily 
restricting that?
   
   As Jack said, we can think of the relationship differently, removing the 
engine requirement. But if we still have `EXTERNAL` in this world, then it is 
really just to support secondary references. I don't know why you would want 
secondary references that look exactly like primary references but are 
restricted from making changes. Just connect to the catalog. And if there's a 
need, we can add a catalog flag that makes it read-only.
   
   That still leaves an open question of what to do for `gc.enabled`. There are 
times when a table doesn't "own" the resources within it and `gc.enabled` is a 
quick way to account for that -- to basically turn off operations that may 
delete files that aren't owned. I think that includes PURGE, but I do see the 
distinction and agree that it is an ugly choice to need to make. Short term, I 
think we should disable PURGE and throw an exception if `gc.enabled` is true.
   
   Longer term, I think we need to refine the idea of file ownership. The best 
way that I can think of is to do this by prefix, as we discussed on a thread in 
the Trino community a while back and in the Iceberg sync. In this model, a 
table gets its own prefix and can delete anything in that prefix (or more 
likely, a small set of prefixes). Imported data files are unowned and are not 
deleted by expire snapshots or purge if they are outside of owned prefixes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.2

Reply via email to