Re: [PR] [SPARK-54444][SQL] Relax DSv2 table checks to restore previous behavior [spark]

via GitHub Mon, 24 Nov 2025 23:09:55 -0800


cloud-fan commented on code in PR #53202:
URL: https://github.com/apache/spark/pull/53202#discussion_r2558784281



##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSuite.scala:
##########
@@ -1707,6 +1753,33 @@ class DataSourceV2DataFrameSuite
     }
   }
 
+  test("SPARK-54444: any schema changes after analysis are prohibited in 
commands") {
+    val s = "testcat.ns1.s"
+    val t = "testcat.ns1.t"
+    withTable(s, t) {
+      sql(s"CREATE TABLE $s (id bigint, data string) USING foo")
+      sql(s"INSERT INTO $s VALUES (1, 'a'), (2, 'b')")
+
+      // create source DataFrame without executing it
+      val sourceDF = spark.table(s)
+
+      // derive another DataFrame from pre-analyzed source
+      val filteredSourceDF = sourceDF.filter("id < 10")
+
+      // add column
+      sql(s"ALTER TABLE $s ADD COLUMN dep STRING")
+
+      // insert more data into source table
+      sql(s"INSERT INTO $s VALUES (3, 'c', 'finance')")
+
+      // CTAS should fail as commands must operate on current schema

Review Comment:
   I have a different opinion on this. The motivation of DS v2 table version 
refresh is to workaround the eager analysis behavior of Spark Classic. An 
analyzed DataFrame can be referenced later to construct new DataFrames, and we 
don't want analyzed DataFrame to stick to an old table version.
   
   However, commands are different. Spark always use a dedicated 
`QueryExecution` to eagerly execute commands, so there no point to refresh. We 
can let a command pin the table versions after it's analyzed, and there is 
nothing wrong with it.



##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSuite.scala:
##########
@@ -1707,6 +1753,33 @@ class DataSourceV2DataFrameSuite
     }
   }
 
+  test("SPARK-54444: any schema changes after analysis are prohibited in 
commands") {
+    val s = "testcat.ns1.s"
+    val t = "testcat.ns1.t"
+    withTable(s, t) {
+      sql(s"CREATE TABLE $s (id bigint, data string) USING foo")
+      sql(s"INSERT INTO $s VALUES (1, 'a'), (2, 'b')")
+
+      // create source DataFrame without executing it
+      val sourceDF = spark.table(s)
+
+      // derive another DataFrame from pre-analyzed source
+      val filteredSourceDF = sourceDF.filter("id < 10")
+
+      // add column
+      sql(s"ALTER TABLE $s ADD COLUMN dep STRING")
+
+      // insert more data into source table
+      sql(s"INSERT INTO $s VALUES (3, 'c', 'finance')")
+
+      // CTAS should fail as commands must operate on current schema

Review Comment:
   I have a different opinion on this. The motivation of DS v2 table version 
refresh is to workaround the eager analysis behavior of Spark Classic. An 
analyzed DataFrame can be referenced later to construct new DataFrames, and we 
don't want analyzed DataFrame to stick to an old table version.
   
   However, commands are different. Spark always use a dedicated 
`QueryExecution` to eagerly execute commands, so there no point to refresh. We 
can let a command pin the table versions after it's analyzed, and there is 
nothing wrong with it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54444][SQL] Relax DSv2 table checks to restore previous behavior [spark]

Reply via email to