juliuszsompolski opened a new pull request, #53173:
URL: https://github.com/apache/spark/pull/53173

   ### What changes were proposed in this pull request?
   
   Make DataFrameWriter saveAsTable add a writeOption `isDataFrameWriterV1 = 
true` when using Overwrite mode with a `delta` Data source.
   This is an emergency fix to prevent a breaking change resulting in data 
corruption with Delta data sources in Spark 4.1.
   
   ### Why are the changes needed?
   
   Spark's SaveMode.Overwrite is documented as:
   
   ```
           * if data/table already exists, existing data is expected to be 
overwritten
           * by the contents of the DataFrame.
   ```
   It does not define the behaviour of overwriting the table metadata (schema, 
etc). Delta datasource interpretation of this API documentation of 
DataFrameWriter V1 is to not replace table schema, unless Delta-specific option 
"overwriteSchema" is set to true.
   
   However, DataFrameWriter V1 creates a ReplaceTableAsSelect plan, which is 
the same as the plan of DataFrameWriterV2 createOrReplace API, which is 
documented as:
   
   ```
          * The output table's schema, partition layout, properties, and other 
configuration
          * will be based on the contents of the data frame and the 
configuration set on this
          * writer. If the table exists, its configuration and data will be 
replaced.
   ```
   
   Therefore, for calls via DataFrameWriter V2 createOrReplace, the metadata 
always needs to be replaced, and Delta datasource doesn't use the 
overwriteSchema option.
   Since the created plan is exactly the same, Delta had used a very ugly hack 
to detect where the API call is coming from based on the stack trace of the 
call.
   
   In Spark 4.1 in connect mode, this stopped working because planning and 
execution of the commands go decoupled, and the stack trace no longer contains 
this point where the plan got created.
   
   To retain compatibility of the Delta datasource with Spark 4.1 in connect 
mode, Spark provides this explicit storage option to indicate to Delta 
datasource that this call is coming from DataFrameWriter V1.
   
   Followup: Since the details of the documented semantics of Spark's 
DataFrameWriter V1  saveAsTable API differs from that of CREATE/REPLACE TABLE 
AS SELECT, Spark should  not be reusing the exact same logical plan for these 
APIs.
   Existing Datasources which have been implemented following Spark's 
documentation of these APIs should have a way to differentiate between these 
APIs.
   
   However, at this point this is an emergency fix, as releasing Spark 4.1 as 
is would cause data corruption issues with Delta in DataFrameWriter saveAsTable 
in overwrite mode, as it would not be correctly interpreting it's 
overwriteSchema mode.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   It has been tested with tests that are not part of the PR. To properly test 
in connect mode, changes are needed on both Spark and Delta side and 
integrating it will be done as followup work.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Assisted by Claude Code.
   Generated-by: claude code, model sonnet 4.5


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to