jbewing commented on code in PR #15150:
URL: https://github.com/apache/iceberg/pull/15150#discussion_r2757027071
##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/SparkWriteOptions.java:
##########
@@ -54,6 +54,7 @@ private SparkWriteOptions() {}
public static final String REWRITTEN_FILE_SCAN_TASK_SET_ID =
"rewritten-file-scan-task-set-id";
public static final String OUTPUT_SPEC_ID = "output-spec-id";
+ public static final String OUTPUT_SORT_ORDER_ID = "output-sort-order-id";
Review Comment:
Absolutely, but unfortunately there are a few core write paths that don't
necessarily "mirror" the table sort order all the time so sometimes it needs to
be customized. Those are:
1. Rewrite Data Files (changes can be seen above in
`SparkShufflingFileRewriteRunner`, but TL;DR: the default is to use the table
sort order, but an operator can pass in any sort order they'd like to this job
and it _will_ comply
2. My memory is a bit fuzzier at this point, but I believe there are a few
write time options you can pass to essentially ignore table ordering. The ones
that come to mind here that I can remember are: "fanout-enabled &
write-distribution=none (in these cases the Spark sort order will match the
Iceberg sort order though). SparkWriteUtil handles a lot of the heavy lifting
here as far as how these interact with Spark <-> Iceberg sort orders, but in a
lot modes it's actually not possible to directly extract the iceberg sort order
intent from the Spark execution plan as Spark will sometimes prepend partition
keys to the sort order keys in certain operations.
Worth taking a look around SparkWriteUtil as I said, but the exciting thing
I learned on this PR adventure is: Iceberg Table sort order is not necessarily
equal to Iceberg operation sort order which is also not necessarily equal to
Spark sort order
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]