[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7637: Spark 3.4: Distribution and ordering enhancements

via GitHub Fri, 19 May 2023 14:23:41 -0700


aokolnychyi commented on code in PR #7637:
URL: https://github.com/apache/iceberg/pull/7637#discussion_r1199411854



##########
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java:
##########
@@ -346,17 +490,21 @@ public void testRangeWritePartitionedSortedTable() {
   //
   // PARTITIONED BY date, days(ts) UNORDERED
   // -------------------------------------------------------------------------
-  // delete mode is NOT SET -> CLUSTER BY _file + LOCALLY ORDER BY date, 
days(ts), _file, _pos
+  // delete mode is NOT SET -> CLUSTER BY date, days(ts) + LOCALLY ORDER BY 
date, days(ts)
+  // delete mode is NOT SET (fanout) -> CLUSTER BY date, days(ts) + empty 
ordering
   // delete mode is NONE -> unspecified distribution + LOCALLY ORDERED BY 
date, days(ts)
-  // delete mode is HASH -> CLUSTER BY _file + LOCALLY ORDER BY date, 
days(ts), _file, _pos
-  // delete mode is RANGE -> ORDER BY date, days(ts), _file, _pos
+  // delete mode is NONE (fanout) -> unspecified distribution + empty ordering
+  // delete mode is HASH -> CLUSTER BY date, days(ts) + LOCALLY ORDER BY date, 
days(ts)
+  // delete mode is HASH (fanout) -> CLUSTER BY date, days(ts) + empty ordering
+  // delete mode is RANGE -> ORDER BY date, days(ts)
+  // delete mode is RANGE (fanout) -> RANGE DISTRIBUTE BY date, days(ts) + 
empty ordering
   //
   // PARTITIONED BY date ORDERED BY id
   // -------------------------------------------------------------------------
-  // delete mode is NOT SET -> CLUSTER BY _file + LOCALLY ORDER BY date, id

Review Comment:
   I am ditching clustering by `_file` in favor of clustering by partition 
columns for CoW operations to reduce the number of produced files. Right now, 
each output task may get records from various files and partitions. Hence, we 
produce more than needed files. This was done to avoid OOM exceptions with too 
large partitions. This is no longer a problem with AQE writes in Spark 3.4.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7637: Spark 3.4: Distribution and ordering enhancements

Reply via email to