talatuyarer opened a new pull request, #16330:
URL: https://github.com/apache/iceberg/pull/16330

   RewriteDataFiles with the SORT strategy currently re-sorts every file the 
planner selects, regardless of whether that file is already sorted by the 
table's current sort order. The `BinPackRewriteFilePlanner` selects files 
purely based on size and delete-file metadata (min/max-file-size-bytes, 
delete-file-threshold, delete-ratio-threshold) and has no awareness of 
`sort_order_id`, even though every data file already records the sort order it 
was written with.
   
   I believe repeated sort maintenance is wasteful. Operators who run sort 
compaction on a schedule re-shuffle and rewrite data that is already correctly 
sorted on every run. Additionally, there is no targeted path for sort-order 
evolution. When a table's sort order changes, existing files keep their old 
`sort_order_id`. Today, the only way to realign them is with `rewrite-all`, 
which also rewrites files that are already sorted correctly.
   
   I added a boolean planner option, `rewrite-stale-sort-order` (default: 
false), to `BinPackRewriteFilePlanner`. When enabled, the planner additionally 
selects data files whose `sort_order_id` does not match the table's current 
default sort order ID. This allows a sort-based rewrite to reorganize only the 
files not already sorted by the current order.
   
   I believe this is a natural extension of the planner's existing 
metadata-driven selection, as `sort_order_id` is another piece of 
already-recorded file metadata the planner can use to decide what is worth 
rewriting. It is intended for use with the SORT strategy, so rewritten files 
are stamped with the table's current sort order ID.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to