[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3661: Spark: Implement copy-on-write DELETE

GitBox Mon, 13 Dec 2021 09:00:24 -0800


aokolnychyi commented on a change in pull request #3661:
URL: https://github.com/apache/iceberg/pull/3661#discussion_r767948148




##########
File path: 
spark/v3.2/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ExtendedV2Writes.scala
##########
@@ -90,6 +93,14 @@ object ExtendedV2Writes extends Rule[LogicalPlan] with 
PredicateHelper {
       }
       val newQuery = ExtendedDistributionAndOrderingUtils.prepareQuery(write, 
query, conf)
       o.copy(write = Some(write), query = newQuery)
+
+    case rd @ ReplaceData(r: DataSourceV2Relation, query, _, None) =>
+      val rowSchema = StructType.fromAttributes(rd.dataInput)
+      val writeBuilder = newWriteBuilder(r.table, rowSchema, Map.empty)
+      val write = writeBuilder.build()
+      // TODO: detect when query contains a shuffle and insert a round-robin 
repartitioning

Review comment:
       I've spent some time thinking and it does not look like there is a 
guarantee an extra round-robin repartition would improve the performance. In 
some cases, it may even make things worse. Here are my thoughts.
   
   - With round-robin, we won’t have to evaluate a condition or join predicate 
twice. In MERGE, we won’t have to evaluate the join condition and won’t have to 
merge the rows. (GOOD)
   - With round-robin, we won't read the data twice from the storage. If 
needed, we fetch shuffle data. (GOOD)
   - With round-robin, we will have an extra shuffle (both write and read). 
Shuffles are expensive and tend to behave poorly at scale. Also, the data will 
be scattered all over the place so each reduce task will need to read from most 
nodes. (BAD)
   - With round-robin, we have extra complexity that may not be always 
beneficial for the query performance. In some cases, this may degrade the 
performance or use more resources. (BAD)
   
   At this point, I am inclined to skip inserting an extra round-robin 
repartition as even merging the rows should not be that bad considering that we 
operate on `InternalRow`. I've spent so much time tuning shuffles recently so 
that I am a bit skeptical it will help here.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3661: Spark: Implement copy-on-write DELETE

Reply via email to