juliuszsompolski commented on PR #53173: URL: https://github.com/apache/spark/pull/53173#issuecomment-3570073315
> 1. Which Apache Spark preview version and Delta version did you test this? Specifically, which preview did this start to happen? I have been testing it with Delta master building with Spark master. The behaviour change was caused by https://github.com/apache/spark/commit/27aba95bdf762b6d1730ed890698cf14d9c4585f, which is present in 4.1. It separates the planning and execution of Command in Spark Connect. The [very ugly hack](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/commands/CreateDeltaTableLike.scala#L174-L177) in Delta depended on the execution of the command happening on the same call trace as calling the API. Now only logical planning happens within that call trace, the plan gets returned to Spark connect, and execution happens later on from a different call trace. > 2. Why don't we implement this in `io.delta.sql.DeltaSparkSessionExtension` or document it instead of changing Apache Spark source code? It would make me very happy to find a way to fix it without changing Spark code. It seems to me however that currently I lose all the ability to distinguish DFWV1 `saveAsTable` with mode overwrite vs. DFWV2 `replace`, because both operations create an identical logical plan. So maaybe, I could move the stack trace hack to DeltaAnalysis and then make DeltaAnalysis do the same kind of stack trace hack... but I would really want to get rid of hack depending on the stack trace, because that is bound to blow up eventually, like this one was a bomb that was ticking for 6 years... > 3. Do you think we can have a test coverage with a dummy data source? I could have a test with a Dummy data source that verifies that this option is added for saveAsTable, but the e2e behavior change is Delta specific. > 4. If this is an emergency fix, what would be the non-emergency fix? > > This is an emergency fix to prevent a breaking change resulting in data corruption with Delta data sources in Spark See my other comment https://github.com/apache/spark/pull/53173#discussion_r2555560154 for some of the options I am exploring. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
