[GitHub] [spark] MaxGekk commented on a change in pull request #30979: [SPARK-33941][SQL] Refresh cache after partitions dropping from v1 table

GitBox Wed, 30 Dec 2020 12:58:04 -0800


MaxGekk commented on a change in pull request #30979:
URL: https://github.com/apache/spark/pull/30979#discussion_r550325156




##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableDropPartitionSuite.scala
##########
@@ -42,6 +42,18 @@ trait AlterTableDropPartitionSuiteBase extends 
command.AlterTableDropPartitionSu
       checkPartitions(t) // no partitions
     }
   }
+
+  test("SPARK-33941: invalidate cache after partition dropping") {
+    withNamespaceAndTable("ns", "tbl") { t =>
+      sql(s"CREATE TABLE $t (id int, part int) $defaultUsing PARTITIONED BY 
(part)")
+      sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+      val df = spark.table(t)
+      df.cache()
+      assert(!df.isEmpty)
+      sql(s"ALTER TABLE $t DROP PARTITION (part=0)")
+      assert(df.isEmpty)

Review comment:
       I still use `df.isEmpty` instead of checking the content of the 
dataframe because `sparkSession.catalog.refreshTable` is not enough, it seems. 
It doesn't refresh the file indexes for the table.
   
   So, test fails:
   ```
     test("SPARK-33941: refresh cache after partition dropping") {
       withNamespaceAndTable("ns", "tbl") { t =>
         sql(s"CREATE TABLE $t (id int, part int) $defaultUsing PARTITIONED BY 
(part)")
         sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
         sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
         val df = spark.table(t)
         df.cache()
         df.collect()
         sql(s"ALTER TABLE $t DROP PARTITION (part=0)")
         df.collect()
       }
     }
   ```
   with the exception:
   ```
   java.io.FileNotFoundException: File 
file:/Users/maximgekk/proj/drop-partition-invalidate-cache/spark-warehouse/org.apache.spark.sql.execution.command.v1.AlterTableDropPartitionSuite/ns.db/tbl/part=0/part-00000-4536ff40-a1ac-41f4-9ff5-7fbc7d67c7b9.c000.snappy.parquet
 does not exist
   It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
   ```
   
   And `REFRESH TABLE` after `ALTER TABLE .. DROP PARTITION` doesn't help since 
it doesn't refresh the file index too :-(




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] MaxGekk commented on a change in pull request #30979: [SPARK-33941][SQL] Refresh cache after partitions dropping from v1 table

Reply via email to