juliuszsompolski commented on PR #53173:
URL: https://github.com/apache/spark/pull/53173#issuecomment-3573210572
@dongjoon-hyun
With that, I don't see any option to distinguish DFWV1 `saveAsTable` and
DFWV2 `replace` without a Spark side change. This PR is an abstraction-leaky,
but scoped as narrowly as possible change to prevent a behaviour change in
Delta with Spark 4.1 that would be very user unfriendly - cause tables metadata
to be overwritten in workloads that didn't overwrite it before.
The change is essentially two lines:
```
extraOptions + ("isDataFrameWriterV1" -> "true")
```
Adding the option.
```
if (source == "delta")
```
If source is delta.
We could omit this, and add it always.
Then it wouldn't create a precedent on "have a 3rd-party company data source
in Apache Spark source code" (note though that I am not referencing any company
but open source project Delta, and there are many precedents in DataFrameWriter
itself having multiple references to Hive, Hadoop, Parquet, Orc in both its
code and public documentation). But then it would be less narrowly scoped,
piggy backing this option to other datasources that don't need or don't expect
it.
Or, to not mention Delta by name, maybe a new interface `interface
RequiresDataFrameWriterV1WriteOption extends TableProvider`, which would then
add this option to all V2 source write commands created by DataFrameWriter V1
(to be more generic than this one specific Overwrite problem - and this would
also match the current Delta behavior, which detects DFWV1 on stack trace all
the time).
Justification for such an interface would be that the documentation of the
public APIs of DFW V1 allows for different interpretation, so some existing
datasources may have adopted different behaviors, and now need to distinguish
whether the plan is actually coming from DataFrameWriter V1 or other API?
What do you think about adding a new interface like that?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]