jbewing commented on code in PR #15150:
URL: https://github.com/apache/iceberg/pull/15150#discussion_r2757050707
##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/SparkWriteRequirements.java:
##########
@@ -26,18 +26,32 @@
/** A set of requirements such as distribution and ordering reported to Spark
during writes. */
public class SparkWriteRequirements {
+ public static final long NO_ADVISORY_PARTITION_SIZE = 0;
public static final SparkWriteRequirements EMPTY =
- new SparkWriteRequirements(Distributions.unspecified(), new
SortOrder[0], 0);
+ new SparkWriteRequirements(
+ Distributions.unspecified(),
+ new SortOrder[0],
+ org.apache.iceberg.SortOrder.unsorted(),
+ NO_ADVISORY_PARTITION_SIZE);
private final Distribution distribution;
private final SortOrder[] ordering;
+ private final org.apache.iceberg.SortOrder icebergOrdering;
private final long advisoryPartitionSize;
SparkWriteRequirements(
Review Comment:
So you probably could get away with just passing the id all the way down and
that is actually what is _effectively_ happening here.
We just end up unwrapping from an Id to an Iceberg Sort Order as it's
effective at making the code a bit more expressive & readable in some places
IMO. SparkWriteRequirement is a nice example of that in that having the Iceberg
sort order available makes it _really_ easy to express how the Spark execution
sort orders should behave when the Spark ordering doesn't necessarily match the
iceberg ordering (and an additional prefix is thrown in there because we're
using a range write distribution for example.
I'm happy to unwind this if you don't think that this is the case and the
other way is more expressive. I did find in my many iterations of solving this
problem "cleanly" that keeping the Sort Orders together—despite the fully
qualified class name terribleness—shows the relationship nicely between the two
& keeps things concise.
Not passing the Iceberg ordering through is substantially more brittle and
prone to breakage (although a bit more concise), however, correctness felt more
important the being concise. And passing an Iceberg Sort Order Id down just
leads to it being unwrapped from an id in quite a few places and a ton of `if
(id == 0 / UNSORTED) {}` checks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]