juliuszsompolski commented on PR #53173:
URL: https://github.com/apache/spark/pull/53173#issuecomment-3570073315

   > 1. Which Apache Spark preview version and Delta version did you test this? 
Specifically, which preview did this start to happen?
   
   I have been testing it with Delta master building with Spark master.
   The behaviour change was caused by 
https://github.com/apache/spark/commit/27aba95bdf762b6d1730ed890698cf14d9c4585f,
 which is present in 4.1. It separates the planning and execution of Command in 
Spark Connect. The [very ugly 
hack](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/commands/CreateDeltaTableLike.scala#L174-L177)
 in Delta depended on the execution of the command happening on the same call 
trace as calling the API. Now only logical planning happens within that call 
trace, the plan gets returned to Spark connect, and execution happens later on 
from a different call trace.
   
   > 2. Why don't we implement this in 
`io.delta.sql.DeltaSparkSessionExtension` or document it instead of changing 
Apache Spark source code?
   
   It would make me very happy to find a way to fix it without changing Spark 
code. It seems to me however that currently I lose all the ability to 
distinguish DFWV1 `saveAsTable` with mode overwrite vs. DFWV2 `replace`, 
because both operations create an identical logical plan.
   So maaybe, I could move the stack trace hack to DeltaAnalysis and then make 
DeltaAnalysis do the same kind of stack trace hack... but I would really want 
to get rid of hack depending on the stack trace, because that is bound to blow 
up eventually, like this one was a bomb that was ticking for 6 years...
   
   > 3. Do you think we can have a test coverage with a dummy data source?
   
   I could have a test with a Dummy data source that verifies that this option 
is added for saveAsTable, but the e2e behavior change is Delta specific.
   
   > 4. If this is an emergency fix, what would be the non-emergency fix?
   >    > This is an emergency fix to prevent a breaking change resulting in 
data corruption with Delta data sources in Spark
   
   See my other comment 
https://github.com/apache/spark/pull/53173#discussion_r2555560154 for some of 
the options I am exploring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to