juliuszsompolski opened a new pull request, #53215:
URL: https://github.com/apache/spark/pull/53215
### What changes were proposed in this pull request?
This PR adds a new marker interface
RequiresDataFrameWriterV1SaveAsTableOverwriteWriteOption that data sources can
implement to distinguish between DataFrameWriter V1 saveAsTable with
SaveMode.Overwrite and DataFrameWriter V2 createOrReplace/replace operations.
* Added RequiresDataFrameWriterV1SaveAsTableOverwriteWriteOption interface
in the catalog package
* Modified DataFrameWriter.saveAsTable to add an internal write option
(`__isDataFrameWriterV1SaveAsTableOverwrite__=true`) when the provider
implements this interface and mode is Overwrite
* Added tests verifying the option is only added for providers that opt-in
### Why are the changes needed?
Spark's SaveMode.Overwrite is documented as:
```
* if data/table already exists, existing data is expected to be
overwritten
* by the contents of the DataFrame.
```
It does not define the behaviour of overwriting the table metadata (schema,
etc).
However, DataFrameWriter V1 creates a ReplaceTableAsSelect plan, which is
the same as the plan of DataFrameWriterV2 createOrReplace API, which is
documented as:
```
* The output table's schema, partition layout, properties, and other
configuration
* will be based on the contents of the data frame and the
configuration set on this
* writer. If the table exists, its configuration and data will be
replaced.
```
Therefore, for calls via DataFrameWriter V2 createOrReplace, the metadata
always needs to be replaced.
Datasources migrating from V1 to V2 might have interpreted it differently.
In particular, Delta Lake datasource interpretation of this API
documentation of DataFrameWriter V1 is to not replace table schema, unless
Delta-specific option "overwriteSchema" is set to true. Changing the bahaviour
to the V2 semantics is unfriendly to the users, as it can cause corruption of
the tables: an operations that overwrote only data before, will now also
overwrite the table's schema, partitioning info and other properties.
Since the created plan is exactly the same, Delta used a very ugly hack to
detect where the API call is coming from based on the stack trace of the call.
In Spark 4.1 in connect mode, this stopped working because planning and
execution of the commands go decoupled, and the stack trace no longer contains
this point where the plan got created.
To not introduce a behaviour change in the Delta datasource with Spark 4.1
in connect mode, Spark provides this new interface
RequiresDataFrameWriterV1SaveAsTableOverwriteWriteOption which will make
DataFrameWriter V1 add an explicit storage option to indicate to Delta
datasource that this call is coming from DataFrameWriter V1.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Spark tests added.
It was tested locally with Delta Lake side of changes. It cannot yet be
raised in a Delta Lake PR, because Delta Lake master branch does not yet
cross-compile with Spark 4.1 (WIP).
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code opus-4.5
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]