jbewing commented on PR #14683:
URL: https://github.com/apache/iceberg/pull/14683#issuecomment-3752825752

   Sorry to keep y'all waiting—I've been a bit busy recently. Given that I'd 
been waiting for a review for 6 months, what's 6 days really 😅 .
   
   But @anuragmantri nailed the core piece of this PR. In the Iceberg Spark 
write path, typically a Spark based sort order was passed through and surfaced 
to the Spark APIs. This Spark-based sort order is typically derived from both 
the Iceberg table sort order, the write distribution requirements, and 
partitioning requirements. This makes it harder to go from the Spark ordering 
back to the Iceberg ordering (in fact, sometimes it's impossible).
   
   So what I did here is that essentially we pass through both orderings: the 
ordering that Iceberg intends in addition to the Spark ordering (see 
SparkWriteUtil). The Spark ordering is passed through as it was and we use the 
Iceberg ordering to set manifest entries at write time. That's the idea, 
there's a bit more plumbing involved in some places e.g. rewrite data files 
jobs can set custom non-table sort orders and regular writes can also set 
non-table sort orders. 
   
   > Also I would recommend just doing the changes in 1 Spark version for 
review purposes, we can backport to anything applicable afterwards.
   
   Yeah I did that 6 months ago and you're welcome to review that PR: 
https://github.com/apache/iceberg/pull/13636 or collapse one of the 3.5 or 4.0 
directories during a review of this PR (they are mirrors of each other after 
all as I just applied the patch for 3.5 directly to 4.0). Or I can split it up 
more if you'd like, it's just time on my end and [I’ve split up PRs for easier 
review in the past, but have occasionally found those PRs don’t get feedback. 
If there’s a preference for how to structure these, just let me 
know!](https://apache-iceberg.slack.com/archives/C03LG1D563F/p1765466132290829).
 
   
   In any case, rant aside. I'm willing to do the work if someone is actually 
interested in reviewing & giving me feedback here so that this can get 
integrated into master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to