[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.2

GitBox Sun, 13 Feb 2022 18:32:44 -0800


rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805401552




##########
File path: 
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestDropTable.java
##########
@@ -0,0 +1,260 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.sql;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.commons.io.FileUtils;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkCatalog;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+public class TestDropTable extends SparkCatalogTestBase {
+  private Map<String, String> config = null;
+  private String implementation = null;
+  private SparkSession session = null;
+
+  @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, 
config = {2}")
+  public static Object[][] parameters() {
+    return new Object[][]{
+        {"testhive", CustomSparkCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hive",
+                "default-namespace", "default"
+            )},
+        {"testhadoop", CustomSparkCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hadoop"
+            )},
+        {"spark_catalog", CustomSparkSessionCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hive",
+                "default-namespace", "default",
+                "parquet-enabled", "true",
+                "cache-enabled", "false" // Spark will delete tables using v1, 
leaving the cache out of sync
+            )}
+    };
+  }
+
+  public TestDropTable(String catalogName, String implementation, Map<String, 
String> config) {
+    super(catalogName, implementation, config);
+    this.config = config;
+    this.implementation = implementation;
+  }
+
+  @Before
+  public void createTable() {
+    // Spark CatalogManager cached the loaded catalog, here we use new 
SparkSession to force it load the catalog again

Review comment:
       This doesn't sound correct to me. You should not need to alter the 
catalog to check whether a table is purged. Can't you get the location from the 
table itself? And this appears to make the tests specific to the Hadoop catalog.

##########
File path: 
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, 
TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       > 100% Agree. However we now opened the door by adding registerTable in 
catalog, which maps to the external table concept perfectly. I already received 
a few feature requests of people asking for this to map to CREATE EXTERNAL 
TABLE. People can now register an Iceberg table with an old metadata file 
location and do writes against it to create basically 2 diverged metadata 
history of the same table. This is very dangerous action because 2 Iceberg 
tables can now own the same set of files and corrupt each other.
   
   I'm not sure I agree that it maps perfectly. This is a way to register a 
table with a catalog, after which the catalog owns it like any other table. 
There should be nothing that suggests registration has anything to do with 
`EXTERNAL` and no reason for people to think that tables that are added to a 
catalog through `registerTable` should behave any differently.
   
   If this confusion persists, I would support removing `registerTable` from 
the API.
   
   > Just from correctness perspective, this is the wrong thing to promote.
   
   Agreed!
   
   > In the long term, we should start to promote a new table ownership model 
(maybe call it a SHARED model) and start to bring people up to date with how 
Iceberg tables are operated. Let me draft a doc for that to have a formal 
discussion, and also include concepts like table root location ownership in 
that doc so we can have full clarity in the domain of table ownership.
   
   I'm not sure that I would want a `SHARED` keyword -- that just implies there 
are times when the table is not shared and we would get into similar trouble. 
But I think your idea to address this in a design doc is good.
   
   Also, I consider the data/file ownership a separate problem, so you may want 
to keep them separate in design docs or proposals. I wouldn't want to confuse 
table modification with data file ownership, although modification does have 
implications for file ownership.
   
   > I think if we change the behavior of drop table to not drop any data that 
alleviates our concern on accidental drops on external tables. However, it also 
means that drop table on managed tables would leave data around, which is also 
an issue.
   
   This is why Iceberg ignores `EXTERNAL`. The platform should be making these 
decisions, ideally. Users interact with logical tables, physical concerns are 
for the platform. If you don't have a platform-level plan for dropping table 
data, then I think the `PURGE` approach is okay because a user presumably makes 
the choice at the right time (rather than months if not years before the drop).
   
   My general recommendation is to tell users that they're logically dropping 
tables and data. Maybe you can have a platform-supported way to un-delete, but 
when you drop a table you generally have no expectation that you didn't do 
anything destructive!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.2

Reply via email to