juliuszsompolski opened a new pull request, #53215:
URL: https://github.com/apache/spark/pull/53215

   ### What changes were proposed in this pull request?
   
   This PR adds a new marker interface 
RequiresDataFrameWriterV1SaveAsTableOverwriteWriteOption that data sources can 
implement to distinguish between DataFrameWriter V1 saveAsTable with 
SaveMode.Overwrite and DataFrameWriter V2 createOrReplace/replace operations.
   
   * Added RequiresDataFrameWriterV1SaveAsTableOverwriteWriteOption interface 
in the catalog package
   * Modified DataFrameWriter.saveAsTable to add an internal write option 
(`__isDataFrameWriterV1SaveAsTableOverwrite__=true`) when the provider 
implements this interface and mode is Overwrite
   * Added tests verifying the option is only added for providers that opt-in
   
   ### Why are the changes needed?
   
   Spark's SaveMode.Overwrite is documented as:
   
   ```
           * if data/table already exists, existing data is expected to be 
overwritten
           * by the contents of the DataFrame.
   ```
   It does not define the behaviour of overwriting the table metadata (schema, 
etc).
   
   However, DataFrameWriter V1 creates a ReplaceTableAsSelect plan, which is 
the same as the plan of DataFrameWriterV2 createOrReplace API, which is 
documented as:
   
   ```
          * The output table's schema, partition layout, properties, and other 
configuration
          * will be based on the contents of the data frame and the 
configuration set on this
          * writer. If the table exists, its configuration and data will be 
replaced.
   ```
   
   Therefore, for calls via DataFrameWriter V2 createOrReplace, the metadata 
always needs to be replaced.
   
   Datasources migrating from V1 to V2 might have interpreted it differently.
   
   In particular, Delta Lake datasource interpretation of this API 
documentation of DataFrameWriter V1 is to not replace table schema, unless 
Delta-specific option "overwriteSchema" is set to true. Changing the bahaviour 
to the V2 semantics is unfriendly to the users, as it can cause corruption of 
the tables: an operations that overwrote only data before, will now also 
overwrite the table's schema, partitioning info and other properties.
   
   Since the created plan is exactly the same, Delta used a very ugly hack to 
detect where the API call is coming from based on the stack trace of the call.
   In Spark 4.1 in connect mode, this stopped working because planning and 
execution of the commands go decoupled, and the stack trace no longer contains 
this point where the plan got created.
   To not introduce a behaviour change in the Delta datasource with Spark 4.1 
in connect mode, Spark provides this new interface 
RequiresDataFrameWriterV1SaveAsTableOverwriteWriteOption which will make 
DataFrameWriter V1 add an explicit storage option to indicate to Delta 
datasource that this call is coming from DataFrameWriter V1.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Spark tests added.
   It was tested locally with Delta Lake side of changes. It cannot yet be 
raised in a Delta Lake PR, because Delta Lake master branch does not yet 
cross-compile with Spark 4.1 (WIP).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Generated-by: Claude Code opus-4.5


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to