spark git commit: [SPARK-24675][SQL] Rename table: validate existence of new location

lixiao Thu, 05 Jul 2018 09:25:37 -0700

Repository: spark
Updated Branches:
  refs/heads/master ac78bcce0 -> 33952cfa8



[SPARK-24675][SQL] Rename table: validate existence of new location

## What changes were proposed in this pull request?
If table is renamed to a existing new location, data won't show up.
```
scala>  Seq("hello").toDF("a").write.format("parquet").saveAsTable("t")

scala> sql("select * from t").show()
+-----+
|    a|
+-----+
|hello|
+-----+

scala> sql("alter table t rename to test")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("select * from test").show()
+---+
|  a|
+---+
+---+
```
The file layout is like
```
$ tree test
test
âââ gabage
âââ t
    âââ _SUCCESS
    âââ 
part-00000-856b0f10-08f1-42d6-9eb3-7719261f3d5e-c000.snappy.parquet
```

In Hive, if the new location exists, the renaming will fail even the location 
is empty.

We should have the same validation in Catalog, in case of unexpected bugs.

## How was this patch tested?

New unit test.

Author: Gengliang Wang <gengliang.w...@databricks.com>

Closes #21655 from gengliangwang/validate_rename_table.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/33952cfa
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/33952cfa
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/33952cfa

Branch: refs/heads/master
Commit: 33952cfa8182c1e925083e18c63c6152dcc3c8b4
Parents: ac78bcc
Author: Gengliang Wang <gengliang.w...@databricks.com>
Authored: Thu Jul 5 09:25:19 2018 -0700
Committer: Xiao Li <gatorsm...@gmail.com>
Committed: Thu Jul 5 09:25:19 2018 -0700

----------------------------------------------------------------------
 docs/sql-programming-guide.md                   |  1 +
 .../sql/catalyst/catalog/SessionCatalog.scala   | 20 ++++++++++++++++++++
 .../spark/sql/execution/command/DDLSuite.scala  | 18 ++++++++++++++++++
 3 files changed, 39 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/33952cfa/docs/sql-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index cd7329b..ad23dae 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1850,6 +1850,7 @@ working with timestamps in `pandas_udf`s to get the best 
performance, see
   - Since Spark 2.4, writing a dataframe with an empty or nested empty schema 
using any file formats (parquet, orc, json, text, csv etc.) is not allowed. An 
exception is thrown when attempting to write dataframes with empty schema.
   - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after 
promotes both sides to TIMESTAMP. To set `false` to 
`spark.sql.hive.compareDateTimestampInTimestamp` restores the previous 
behavior. This option will be removed in Spark 3.0.
   - Since Spark 2.4, creating a managed table with nonempty location is not 
allowed. An exception is thrown when attempting to create a managed table with 
nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
+  - Since Spark 2.4, renaming a managed table to existing location is not 
allowed. An exception is thrown when attempting to rename a managed table to 
existing location.
   - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
   - Since Spark 2.4, Spark has enabled non-cascading SQL cache invalidation in 
addition to the traditional cache invalidation mechanism. The non-cascading 
cache invalidation mechanism allows users to remove a cache without impacting 
its dependent caches. This new cache invalidation mechanism is used in 
scenarios where the data of the cache to be removed is still valid, e.g., 
calling unpersist() on a Dataset, or dropping a temporary view. This allows 
users to free up memory and keep the desired caches valid at the same time.
   - In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` 
respect the timezone in the input timestamp string, which breaks the assumption 
that the input timestamp is in a specific timezone. Therefore, these 2 
functions can return unexpected results. In version 2.4 and later, this problem 
has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if 
the input timestamp string contains timezone. As an example, 
`from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return `2000-10-10 
01:00:00` in both Spark 2.3 and 2.4. However, `from_utc_timestamp('2000-10-10 
00:00:00+00:00', 'GMT+1')`, assuming a local timezone of GMT+8, will return 
`2000-10-10 09:00:00` in Spark 2.3 but `null` in 2.4. For people who don't care 
about this problem and want to retain the previous behaivor to keep their query 
unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. 
This option will be removed in Spark 3.0 and should only be used as a temporary 
w
 orkaround.

http://git-wip-us.apache.org/repos/asf/spark/blob/33952cfa/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
----------------------------------------------------------------------
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
index c390337..c26a345 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
@@ -619,6 +619,7 @@ class SessionCatalog(
         requireTableExists(TableIdentifier(oldTableName, Some(db)))
         requireTableNotExists(TableIdentifier(newTableName, Some(db)))
         validateName(newTableName)
+        validateNewLocationOfRename(oldName, newName)
         externalCatalog.renameTable(db, oldTableName, newTableName)
       } else {
         if (newName.database.isDefined) {
@@ -1366,4 +1367,23 @@ class SessionCatalog(
     // copy over temporary views
     tempViews.foreach(kv => target.tempViews.put(kv._1, kv._2))
   }
+
+  /**
+   * Validate the new locatoin before renaming a managed table, which should 
be non-existent.
+   */
+  private def validateNewLocationOfRename(
+      oldName: TableIdentifier,
+      newName: TableIdentifier): Unit = {
+    val oldTable = getTableMetadata(oldName)
+    if (oldTable.tableType == CatalogTableType.MANAGED) {
+      val databaseLocation =
+        
externalCatalog.getDatabase(oldName.database.getOrElse(currentDb)).locationUri
+      val newTableLocation = new Path(new Path(databaseLocation), 
formatTableName(newName.table))
+      val fs = newTableLocation.getFileSystem(hadoopConf)
+      if (fs.exists(newTableLocation)) {
+        throw new AnalysisException(s"Can not rename the managed 
table('$oldName')" +
+          s". The associated location('$newTableLocation') already exists.")
+      }
+    }
+  }
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/33952cfa/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
----------------------------------------------------------------------
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
index 3998cec..270ed7f 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
@@ -441,6 +441,24 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
     }
   }
 
+  test("rename a managed table with existing empty directory") {
+    val tableLoc = new 
File(spark.sessionState.catalog.defaultTablePath(TableIdentifier("tab2")))
+    try {
+      withTable("tab1") {
+        sql(s"CREATE TABLE tab1 USING $dataSource AS SELECT 1, 'a'")
+        tableLoc.mkdir()
+        val ex = intercept[AnalysisException] {
+          sql("ALTER TABLE tab1 RENAME TO tab2")
+        }.getMessage
+        val expectedMsg = "Can not rename the managed table('`tab1`'). The 
associated location"
+        assert(ex.contains(expectedMsg))
+      }
+    } finally {
+      waitForTasksToFinish()
+      Utils.deleteRecursively(tableLoc)
+    }
+  }
+
   private def checkSchemaInCreatedDataSourceTable(
       path: File,
       userSpecifiedSchema: Option[String],


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24675][SQL] Rename table: validate existence of new location

Reply via email to