jbewing commented on PR #14683: URL: https://github.com/apache/iceberg/pull/14683#issuecomment-3752825752
Sorry to keep y'all waiting—I've been a bit busy recently. Given that I'd been waiting for a review for 6 months, what's 6 days really 😅 . But @anuragmantri nailed the core piece of this PR. In the Iceberg Spark write path, typically a Spark based sort order was passed through and surfaced to the Spark APIs. This Spark-based sort order is typically derived from both the Iceberg table sort order, the write distribution requirements, and partitioning requirements. This makes it harder to go from the Spark ordering back to the Iceberg ordering (in fact, sometimes it's impossible). So what I did here is that essentially we pass through both orderings: the ordering that Iceberg intends in addition to the Spark ordering (see SparkWriteUtil). The Spark ordering is passed through as it was and we use the Iceberg ordering to set manifest entries at write time. That's the idea, there's a bit more plumbing involved in some places e.g. rewrite data files jobs can set custom non-table sort orders and regular writes can also set non-table sort orders. > Also I would recommend just doing the changes in 1 Spark version for review purposes, we can backport to anything applicable afterwards. Yeah I did that 6 months ago and you're welcome to review that PR: https://github.com/apache/iceberg/pull/13636 or collapse one of the 3.5 or 4.0 directories during a review of this PR (they are mirrors of each other after all as I just applied the patch for 3.5 directly to 4.0). Or I can split it up more if you'd like, it's just time on my end and [I’ve split up PRs for easier review in the past, but have occasionally found those PRs don’t get feedback. If there’s a preference for how to structure these, just let me know!](https://apache-iceberg.slack.com/archives/C03LG1D563F/p1765466132290829). In any case, rant aside. I'm willing to do the work if someone is actually interested in reviewing & giving me feedback here so that this can get integrated into master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
