yadavay-amzn commented on code in PR #55858:
URL: https://github.com/apache/spark/pull/55858#discussion_r3469634240


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala:
##########
@@ -367,7 +367,10 @@ case class ReplaceData(
     originalTable: NamedRelation,
     projections: ReplaceDataProjections,
     groupFilterCondition: Option[Expression] = None,
-    write: Option[Write] = None) extends RowLevelWrite {
+    write: Option[Write] = None)
+    extends RowLevelWrite with SupportsNonDeterministicExpression {
+
+  override def allowNonDeterministicExpression: Boolean = operation.command() 
== MERGE

Review Comment:
   Thanks for catching this; you were right that it was unsafe. I reproduced 
the silent lost update (a non-deterministic source whose key set differs 
between the two scans makes the pruning subquery drop a target group the main 
merge matches, losing the update) and pushed a fix.
   
   Rather than materialize/checkpoint the source, I went at the root of the 
hazard: skip runtime group filtering when its condition is non-deterministic - 
a `cond.deterministic` gate in `canInjectGroupFilters`, since the 
`groupFilterCondition` is an `Exists` over the source and is non-deterministic 
exactly when the source is. That removes the double-scan entirely, and being a 
single centralized gate it covers both the ReplaceData and WriteDelta paths. 
Deterministic-source MERGE still gets the group filter, so the optimization is 
unaffected for the common case. I did evaluate materialization first, but it 
turned out to depend on exchange/subquery reuse, which deliberately doesn't 
dedup non-deterministic plans - so correctness would hinge on an optimization 
that can be disabled; the guard instead follows Spark's existing stance of not 
optimizing non-deterministic plans (e.g. InlineCTE).
   
   Kept the `SupportsNonDeterministicExpression` relaxation so the analysis 
false-positive is fixed. Tests: an end-to-end suite over 
`InMemoryRowLevelOperationTable` that (a) reproduces the original 
`INVALID_NON_DETERMINISTIC_EXPRESSIONS` failure now passing and (b) runs with 
the runtime group filter enabled and asserts correct rows (fails without the 
guard), plus a deterministic-source test confirming the filter is still 
injected, and UPDATE/DELETE negative cases. Also folded the parens nit and 
updated the PR description.
   
   Does this approach look okay to you?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to