MaxGekk commented on a change in pull request #30979:
URL: https://github.com/apache/spark/pull/30979#discussion_r550325156
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableDropPartitionSuite.scala
##########
@@ -42,6 +42,18 @@ trait AlterTableDropPartitionSuiteBase extends
command.AlterTableDropPartitionSu
checkPartitions(t) // no partitions
}
}
+
+ test("SPARK-33941: invalidate cache after partition dropping") {
+ withNamespaceAndTable("ns", "tbl") { t =>
+ sql(s"CREATE TABLE $t (id int, part int) $defaultUsing PARTITIONED BY
(part)")
+ sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
+ val df = spark.table(t)
+ df.cache()
+ assert(!df.isEmpty)
+ sql(s"ALTER TABLE $t DROP PARTITION (part=0)")
+ assert(df.isEmpty)
Review comment:
I still use `df.isEmpty` instead of checking the content of the
dataframe because `sparkSession.catalog.refreshTable` is not enough, it seems.
It doesn't refresh the file indexes for the table.
So, test fails:
```
test("SPARK-33941: refresh cache after partition dropping") {
withNamespaceAndTable("ns", "tbl") { t =>
sql(s"CREATE TABLE $t (id int, part int) $defaultUsing PARTITIONED BY
(part)")
sql(s"INSERT INTO $t PARTITION (part=0) SELECT 0")
sql(s"INSERT INTO $t PARTITION (part=1) SELECT 1")
val df = spark.table(t)
df.cache()
df.collect()
sql(s"ALTER TABLE $t DROP PARTITION (part=0)")
df.collect()
}
}
```
with the exception:
```
java.io.FileNotFoundException: File
file:/Users/maximgekk/proj/drop-partition-invalidate-cache/spark-warehouse/org.apache.spark.sql.execution.command.v1.AlterTableDropPartitionSuite/ns.db/tbl/part=0/part-00000-4536ff40-a1ac-41f4-9ff5-7fbc7d67c7b9.c000.snappy.parquet
does not exist
It is possible the underlying files have been updated. You can explicitly
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in
SQL or by recreating the Dataset/DataFrame involved.
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
```
And `REFRESH TABLE` after `ALTER TABLE .. DROP PARTITION` doesn't help since
it doesn't refresh the file index too :-(
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]