[GitHub] [iceberg] rdblue commented on a change in pull request #3764: Spark: Implement copy-on-write UPDATE

GitBox Sun, 19 Dec 2021 12:55:01 -0800


rdblue commented on a change in pull request #3764:
URL: https://github.com/apache/iceberg/pull/3764#discussion_r771999598




##########
File path: 
spark/v3.2/spark-extensions/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelCommandDynamicPruning.scala
##########
@@ -86,6 +89,16 @@ case class RowLevelCommandDynamicPruning(spark: 
SparkSession) extends Rule[Logic
     val matchingRowsPlan = command match {
       case d: DeleteFromIcebergTable =>
         Filter(d.condition.get, relation)
+      case u: UpdateIcebergTable =>
+        // UPDATEs with subqueries may be rewritten using a UNION with two 
identical scan relations
+        // each scan relation will get its own dynamic filter that will be 
shared during execution
+        // the analyzer will assign different expr IDs for each scan relation 
output attributes
+        // that's why the condition may refer to invalid attr expr IDs and 
must be transformed

Review comment:
       If I understand correctly, sometimes a plan gets transformed, from this:
   
   ```
   ReplaceData
     Update(id#1 IN (1, 5), data#2 = 'foo')
         V2Relation(db.table, [id#1, data#2])
   ```
   
   To this:
   ```
   ReplaceData
     Update(id#1 IN (1, 5), data#2 = 'foo')
       Union([id#1, data#2])
         V2Relation(db.table, [id#1, data#2])
         V2Relation(db.table, [id#4, data#5])
   ```
   
   So this is basically creating the dynamic filter for each scan separately 
and fixing up the attrs, from the update's IDs to the relation's IDs.
   
   Is there a test case for this, or is it something that you ran into in 
practice?
   
   It took me a while to understand this (assuming that I do) and I think it 
would be nice if it were more clear what is being replaced. Renaming `attrMap` 
to `scanRelationAttrs` would probably help so it is clear we're looking up the 
right attr for the scan relation.
   
   It would also be good to to add a comment that explains why `u.table.output` 
is always going to match `relation.output`. In fact, I'm not sure that will. 
What happens when a column in the data is not used to produce the output? Is it 
pruned?
   
   For example, `UPDATE t SET id = 1 WHERE true` doesn't need to project `id` 
from the table. If `id` is not projected, then the `attrMap` is incorrect. Is 
there a reason why that can't happen?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #3764: Spark: Implement copy-on-write UPDATE

Reply via email to