[GitHub] [iceberg] jackye1995 commented on a change in pull request #3056: Support purge for Spark 3.2

GitBox Sun, 13 Feb 2022 00:02:20 -0800


jackye1995 commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805296518




##########
File path: 
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, 
TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       Thank you Ryan very much for the insights!
   
   For the discussion related to GC, I am fine with disallowing purge when 
`gc.enabled` is set to false as a short term solution if we would like to 
define garbage collection that way. We should do it within the catalog 
implementations to ensure this behavior is consistent across engines.
   
   > I really don't think that the concept of EXTERNAL has a place in Iceberg 
and I am very skeptical that we should add it.
   
   100% Agree. However we now opened the door by adding `registerTable` in 
catalog, which maps to the external table concept perfectly. I already received 
a few feature requests of people asking for this to map to CREATE EXTERNAL 
TABLE. People can now register an Iceberg table with an old metadata file 
location and do writes against it to create basically 2 diverged metadata 
history of the same table. This is very dangerous action because 2 Iceberg 
tables can now own the same set of files and corrupt each other.
   
   As the first step, we should make it clear in the javadoc of `registerTable` 
that it's only used to recover a table. Creating 2 references in catalog of the 
same table metadata and do write operations on both to create diverged metadata 
history is not recommended and will have unintended side effects.
   
   When I was suggesting to add the `EXTERNAL` concept to `registerTable`, to 
be honest I was really trying to make peace with people who wants to stick with 
this definition. At least we can have a read only solution and only encourage 
registering a normal table for recovery. But the more I think of this topic, 
the more I feel we should start to promote the right way to operate against 
Iceberg tables.
   
   Just from correctness perspective, this is the wrong thing to promote. Even 
if people just want to read a historical metadata file, information like table 
properties are stale. It is always better to do time travel against the latest 
metadata, or run query against a tagged snapshot recorded in the latest 
metadata, and have the metadata location automatically updated with new commits.
   
   As you said, `EXTERNAL` is also just arbitrarily limiting the ability for 
people to use Iceberg. I would say the only value is at business level. As 
compute vendors or data platforms that have such traditional definition of 
external and managed tables, it is much easier to provide external table 
support for an alien product comparing to full table operation support that is 
usually offered to managed tables only. My team underwent such debate for a 
long time before we decided to go with full support, and we tried really hard 
to explain the entire picture, but I doubt if other people could do the same. 
We can already see some vendors now offer Iceberg support under the name of 
"external table".
   
   In the long term, we should start to promote a new table ownership model 
(maybe call it a `SHARED` model) and start to bring people up to date with how 
Iceberg tables are operated. Let me draft a doc for that to have a formal 
discussion, and also include concepts like table root location ownership in 
that doc so we can have full clarity in the domain of table ownership.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 commented on a change in pull request #3056: Support purge for Spark 3.2

Reply via email to