Re: [PR] [SPARK-54462][SQL] Add `isDataFrameWriterV1` option for Delta datasource compatibility [spark]

via GitHub Sun, 23 Nov 2025 16:00:08 -0800


HyukjinKwon commented on code in PR #53173:
URL: https://github.com/apache/spark/pull/53173#discussion_r2554423999



##########
sql/core/src/main/scala/org/apache/spark/sql/classic/DataFrameWriter.scala:
##########
@@ -484,12 +484,50 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) extends sql.DataFram
           serde = None,
           external = false,
           constraints = Seq.empty)
+        val writeOptions = if (source == "delta") {
+          // This ia a very special workaround for Delta Lake.
+          // Spark's SaveMode.Overwrite is documented as:
+          //   * if data/table already exists, existing data is expected to be 
overwritten
+          //   * by the contents of the DataFrame.
+          // It does not define the behaviour of overwriting the table 
metadata (schema, etc).
+          // Delta datasource interpretation of this API documentation of 
DataFrameWriter V1 is
+          // to not replace table schema, unless Delta-specific option 
"overwriteSchema" is set
+          // to true.
+          //
+          // However, DataFrameWriter V1 creates a ReplaceTableAsSelect plan, 
which is the same as
+          // the plan of DataFrameWriterV2 createOrReplace API, which is 
documented as:
+          //  * The output table's schema, partition layout, properties, and 
other configuration
+          //  * will be based on the contents of the data frame and the 
configuration set on this
+          //  * writer. If the table exists, its configuration and data will 
be replaced.
+          // Therefore, for calls via DataFrameWriter V2 createOrReplace, the 
metadata always needs
+          // to be replaced, and Delta datasource doesn't use the 
overwriteSchema option.
+          //
+          // Since the created plan is exactly the same, Delta had used a very 
ugly hack to detect
+          // where the API call is coming from based on the stack trace of the 
call.
+          //
+          // In Spark 4.1 in connect mode, this stopped working because 
planning and execution of
+          // the commands go decoupled, and the stack trace no longer contains 
this point where the
+          // plan got created.
+          //
+          // To retain compatibility of the Delta datasource with Spark 4.1 in 
connect mode, Spark
+          // provides this explicit storage option to indicate to Delta 
datasource that this call
+          // is coming from DataFrameWriter V1.
+          //

Review Comment:
   Per:
   
   ```
             // FIXME: Since the details of the documented semantics of Spark's 
DataFrameWriter V1
             //  saveAsTable API differs from that of CREATE/REPLACE TABLE AS 
SELECT, Spark should
             //  not be reusing the exact same logical plan for these APIs.
             //  Existing Datasources which have been implemented following 
Spark's documentation of
             //  these APIs should have a way to differentiate between these 
APIs.
    ```
    
   Why don't we just always append the option? The downstream datasources who 
care about this behaviour will make the change accordingly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54462][SQL] Add `isDataFrameWriterV1` option for Delta datasource compatibility [spark]

Reply via email to