cloud-fan commented on code in PR #53202:
URL: https://github.com/apache/spark/pull/53202#discussion_r2558784281
##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSuite.scala:
##########
@@ -1707,6 +1753,33 @@ class DataSourceV2DataFrameSuite
}
}
+ test("SPARK-54444: any schema changes after analysis are prohibited in
commands") {
+ val s = "testcat.ns1.s"
+ val t = "testcat.ns1.t"
+ withTable(s, t) {
+ sql(s"CREATE TABLE $s (id bigint, data string) USING foo")
+ sql(s"INSERT INTO $s VALUES (1, 'a'), (2, 'b')")
+
+ // create source DataFrame without executing it
+ val sourceDF = spark.table(s)
+
+ // derive another DataFrame from pre-analyzed source
+ val filteredSourceDF = sourceDF.filter("id < 10")
+
+ // add column
+ sql(s"ALTER TABLE $s ADD COLUMN dep STRING")
+
+ // insert more data into source table
+ sql(s"INSERT INTO $s VALUES (3, 'c', 'finance')")
+
+ // CTAS should fail as commands must operate on current schema
Review Comment:
I have a different opinion on this. The motivation of DS v2 table version
refresh is to workaround the eager analysis behavior of Spark Classic. An
analyzed DataFrame can be referenced later to construct new DataFrames, and we
don't want analyzed DataFrame to stick to an old table version.
However, commands are different. Spark always use a dedicated
`QueryExecution` to eagerly execute commands, so there no point to refresh. We
can let a command pin the table versions after it's analyzed, and there is
nothing wrong with it.
##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSuite.scala:
##########
@@ -1707,6 +1753,33 @@ class DataSourceV2DataFrameSuite
}
}
+ test("SPARK-54444: any schema changes after analysis are prohibited in
commands") {
+ val s = "testcat.ns1.s"
+ val t = "testcat.ns1.t"
+ withTable(s, t) {
+ sql(s"CREATE TABLE $s (id bigint, data string) USING foo")
+ sql(s"INSERT INTO $s VALUES (1, 'a'), (2, 'b')")
+
+ // create source DataFrame without executing it
+ val sourceDF = spark.table(s)
+
+ // derive another DataFrame from pre-analyzed source
+ val filteredSourceDF = sourceDF.filter("id < 10")
+
+ // add column
+ sql(s"ALTER TABLE $s ADD COLUMN dep STRING")
+
+ // insert more data into source table
+ sql(s"INSERT INTO $s VALUES (3, 'c', 'finance')")
+
+ // CTAS should fail as commands must operate on current schema
Review Comment:
I have a different opinion on this. The motivation of DS v2 table version
refresh is to workaround the eager analysis behavior of Spark Classic. An
analyzed DataFrame can be referenced later to construct new DataFrames, and we
don't want analyzed DataFrame to stick to an old table version.
However, commands are different. Spark always use a dedicated
`QueryExecution` to eagerly execute commands, so there no point to refresh. We
can let a command pin the table versions after it's analyzed, and there is
nothing wrong with it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]