yadavay-amzn commented on code in PR #55858:
URL: https://github.com/apache/spark/pull/55858#discussion_r3469634240


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala:
##########
@@ -367,7 +367,10 @@ case class ReplaceData(
     originalTable: NamedRelation,
     projections: ReplaceDataProjections,
     groupFilterCondition: Option[Expression] = None,
-    write: Option[Write] = None) extends RowLevelWrite {
+    write: Option[Write] = None)
+    extends RowLevelWrite with SupportsNonDeterministicExpression {
+
+  override def allowNonDeterministicExpression: Boolean = operation.command() 
== MERGE

Review Comment:
   Thanks for catching this; you were right that it was unsafe. I reproduced 
the silent lost update (a non-deterministic source whose key set differs 
between the two scans makes the pruning subquery drop a target group the main 
merge matches, losing the update) and pushed a fix.
   
   Rather than materialize/checkpoint the source, I went at the root of the 
hazard: skip runtime group filtering when its condition is non-deterministic - 
a `cond.deterministic` gate in `canInjectGroupFilters`, since the 
`groupFilterCondition` is an `Exists` over the source and is non-deterministic 
exactly when the source is. That removes the double-scan entirely, and being a 
single centralized gate it covers both the ReplaceData and WriteDelta paths. 
Deterministic-source MERGE still gets the group filter, so the optimization is 
unaffected for the common case. 
   
   I did evaluate materialization first, but it turned out to depend on 
exchange/subquery reuse, which deliberately doesn't dedup non-deterministic 
plans - so correctness would hinge on an optimization that can be disabled; the 
guard instead follows Spark's existing stance of not optimizing 
non-deterministic plans (e.g. InlineCTE).
   
   Kept the `SupportsNonDeterministicExpression` relaxation so the analysis 
false-positive is fixed. 
   
   Tests: an end-to-end suite over `InMemoryRowLevelOperationTable` that (a) 
reproduces the original `INVALID_NON_DETERMINISTIC_EXPRESSIONS` failure now 
passing and (b) runs with the runtime group filter enabled and asserts correct 
rows (fails without the guard), plus a deterministic-source test confirming the 
filter is still injected, and UPDATE/DELETE negative cases. Also folded the 
parens nit and updated the PR description.
   
   Does this approach look okay to you?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to