[GitHub] [iceberg] aokolnychyi commented on a change in pull request #4047: Spark 3.2: Implement merge-on-read MERGE

GitBox Mon, 14 Feb 2022 13:44:16 -0800


aokolnychyi commented on a change in pull request #4047:
URL: https://github.com/apache/iceberg/pull/4047#discussion_r806263344




##########
File path: 
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java
##########
@@ -1567,6 +1567,370 @@ public void 
testRangePositionDeltaUpdatePartitionedTable() {
         table, UPDATE, expectedDistribution, 
SPEC_ID_PARTITION_FILE_POSITION_ORDERING);
   }
 
+  // 
==================================================================================
+  // Distribution and ordering for merge-on-read MERGE operations with 
position deletes
+  // 
==================================================================================
+  //
+  // UNPARTITIONED UNORDERED
+  // -------------------------------------------------------------------------
+  // merge mode is NOT SET -> rely on write distribution and ordering as a 
basis
+  // merge mode is NONE -> unspecified distribution + LOCALLY ORDER BY 
_spec_id, _partition, _file, _pos
+  // merge mode is HASH -> unspecified distribution + LOCALLY ORDER BY 
_spec_id, _partition, _file, _pos
+  // merge mode is RANGE -> unspecified distribution + LOCALLY ORDER BY 
_spec_id, _partition, _file, _pos
+  //
+  // UNPARTITIONED ORDERED BY id, data
+  // -------------------------------------------------------------------------
+  // merge mode is NOT SET -> rely on write distribution and ordering as a 
basis
+  // merge mode is NONE -> unspecified distribution +
+  //                       LOCALLY ORDER BY _spec_id, _partition, _file, _pos, 
id, data
+  // merge mode is HASH -> unspecified distribution +

Review comment:
       Well, I am not sure. I like that our merge and write logic are 
consistent right now. My hope was that AQE would coalesce tasks as needed to 
avoid a huge number of small writing tasks (and hence a huge number of delete 
files). I think AQE should behave better than a round-robin distribution. This 
case is about unpartitioned tables so we will most like produce at a single 
delete file per writing task (that shouldn't be that bad). As long as we don't 
have a huge number of writing tasks, we should be fine, I guess?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a change in pull request #4047: Spark 3.2: Implement merge-on-read MERGE

Reply via email to