juliuszsompolski commented on PR #53173:
URL: https://github.com/apache/spark/pull/53173#issuecomment-3573210572

   @dongjoon-hyun 
   With that, I don't see any option to distinguish DFWV1 `saveAsTable` and 
DFWV2 `replace` without a Spark side change. This PR is an abstraction-leaky, 
but scoped as narrowly as possible change to prevent a behaviour change in 
Delta with Spark 4.1 that would be very user unfriendly - cause tables metadata 
to be overwritten in workloads that didn't overwrite it before.
   
   The change is essentially two lines:
   ```
   extraOptions + ("isDataFrameWriterV1" -> "true")
   ```
   Adding the option.
   ```
   if (source == "delta")
   ```
   If source is delta.
   We could omit this, and add it always.
   Then it wouldn't create a precedent on "have a 3rd-party company data source 
in Apache Spark source code" (note though that I am not referencing any company 
but open source project Delta, and there are many precedents in DataFrameWriter 
itself having multiple references to Hive, Hadoop, Parquet, Orc in both its 
code and public documentation). But then it would be less narrowly scoped, 
piggy backing this option to other datasources that don't need or don't expect 
it.
   
   Or, to not mention Delta by name, maybe a new interface `interface 
RequiresDataFrameWriterV1WriteOption extends TableProvider`, which would then 
add this option to all V2 source write commands created by DataFrameWriter V1 
(to be more generic than this one specific Overwrite problem - and this would 
also match the current Delta behavior, which detects DFWV1 on stack trace all 
the time).
   Justification for such an interface would be that the documentation of the 
public APIs of DFW V1 allows for different interpretation, so some existing 
datasources may have adopted different behaviors, and now need to distinguish 
whether the plan is actually coming from DataFrameWriter V1 or other API?
   What do you think about adding a new interface like that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to