jbewing commented on PR #15150: URL: https://github.com/apache/iceberg/pull/15150#issuecomment-4035625076
Sorry for the delay here @RussellSpitzer . I've added some additional test coverage at your suggestion. I just solved some merge conflicts with master. And I've now gotten a chance to read your approach. The way I see it, your changeset has followed an extremely similar arc to mine. Started with trying to derive iceberg order from spark order. Then, moved to using the iceberg table ordering at write time (in combo w/ the various other misc options out there like explicitly setting a sort order for compactions/rewrites). They diverge by about ~60 LoC with the primary difference between them being: my changeset is a bit more explicit (e.g. uses stronger types & less implicit coupling between iceberg sort order & spark sort). And at this point, that's the total sum of the differences. You're a maintainer so you absolutely get the final call here, but as far as I see it they both have their advantages. I think my approach is probably slightly friendlier to non-long-term readers of the codebase, but it does also come w/ the downside of being slightly less performant in terms of serializing a slightly larger object to the executors. In practice, this has been running in production for us on larger tables (hundreds of TB) that get partially rewritten many many times per day and I've never noticed the increased overhead on a flamegraph, so I don't think that's a crazy concern. But yeah otherwise, I'm happy to change to yours if you think the 60 LoC difference / minute difference in approach makes a huge difference in readability. Having read both: we both converged on the same place for this changeset. They are functionally identical and just two different styles of expressing the same concept. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
