[GitHub] [spark] MaxGekk commented on a change in pull request #31101: [SPARK-34055][SQL] Refresh cache in `ALTER TABLE .. ADD PARTITION`

GitBox Sat, 09 Jan 2021 20:22:40 -0800


MaxGekk commented on a change in pull request #31101:
URL: https://github.com/apache/spark/pull/31101#discussion_r554466741




##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableAddPartitionSuite.scala
##########
@@ -41,6 +45,33 @@ trait AlterTableAddPartitionSuiteBase extends 
command.AlterTableAddPartitionSuit
         "The spec ([p1=]) contains an empty partition column value"))
     }
   }
+
+  test("SPARK-34055: refresh cache in partition adding") {
+    withTable("t") {
+      sql(s"CREATE TABLE t (id int, part int) USING parquet PARTITIONED BY 
(part)")
+      sql("INSERT INTO t PARTITION (part=0) SELECT 0")
+      assert(!spark.catalog.isCached("t"))
+      sql("CACHE TABLE t")
+      assert(spark.catalog.isCached("t"))
+      checkAnswer(sql("SELECT * FROM t"), Seq(Row(0, 0)))
+
+      // Create new partition (part = 1) in the filesystem
+      val information = sql("SHOW TABLE EXTENDED LIKE 't' PARTITION (part = 
0)")

Review comment:
       Sure, as soon as this can be used in one more place, we will move this 
to the common trait.

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##########
@@ -485,6 +485,7 @@ case class AlterTableAddPartitionCommand(
       catalog.createPartitions(table.identifier, batch, ignoreIfExists = 
ifNotExists)
     }
 
+    sparkSession.catalog.refreshTable(table.identifier.quotedString)

Review comment:
       In this particular case, it doesn't matter because table size is 
re-calculated by getting file statuses directly from the filesystem. So, the 
cached data is not used in updating table stats.
   
   I think we should review other places.
   
   Just in case, I will update the test and check that table size is updated 
after adding of the partition.

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##########
@@ -485,6 +485,7 @@ case class AlterTableAddPartitionCommand(
       catalog.createPartitions(table.identifier, batch, ignoreIfExists = 
ifNotExists)
     }
 
+    sparkSession.catalog.refreshTable(table.identifier.quotedString)

Review comment:
       > Just in case, I will update the test and check that table size is 
updated after adding of the partition.
   
   Let me add such checking later independently from this PR. I think I have 
found one more issue relating to looking for `HiveTableRelation` in the cache 
of Cache Manager.

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##########
@@ -485,6 +485,7 @@ case class AlterTableAddPartitionCommand(
       catalog.createPartitions(table.identifier, batch, ignoreIfExists = 
ifNotExists)
     }
 
+    sparkSession.catalog.refreshTable(table.identifier.quotedString)

Review comment:
       > Just in case, I will update the test and check that table size is 
updated after adding of the partition.
   
   Let me add such checking later independently from this PR. I think I have 
found one more issue relating to looking for `HiveTableRelation` in the cache 
of Cache Manager.
   
   It seems cleaning the table stats made in 
https://github.com/apache/spark/pull/30995 is not enough.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] MaxGekk commented on a change in pull request #31101: [SPARK-34055][SQL] Refresh cache in `ALTER TABLE .. ADD PARTITION`

Reply via email to