talatuyarer opened a new pull request, #16330: URL: https://github.com/apache/iceberg/pull/16330
RewriteDataFiles with the SORT strategy currently re-sorts every file the planner selects, regardless of whether that file is already sorted by the table's current sort order. The `BinPackRewriteFilePlanner` selects files purely based on size and delete-file metadata (min/max-file-size-bytes, delete-file-threshold, delete-ratio-threshold) and has no awareness of `sort_order_id`, even though every data file already records the sort order it was written with. I believe repeated sort maintenance is wasteful. Operators who run sort compaction on a schedule re-shuffle and rewrite data that is already correctly sorted on every run. Additionally, there is no targeted path for sort-order evolution. When a table's sort order changes, existing files keep their old `sort_order_id`. Today, the only way to realign them is with `rewrite-all`, which also rewrites files that are already sorted correctly. I added a boolean planner option, `rewrite-stale-sort-order` (default: false), to `BinPackRewriteFilePlanner`. When enabled, the planner additionally selects data files whose `sort_order_id` does not match the table's current default sort order ID. This allows a sort-based rewrite to reorganize only the files not already sorted by the current order. I believe this is a natural extension of the planner's existing metadata-driven selection, as `sort_order_id` is another piece of already-recorded file metadata the planner can use to decide what is worth rewriting. It is intended for use with the SORT strategy, so rewritten files are stamped with the table's current sort order ID. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
