rdblue commented on pull request #1373: URL: https://github.com/apache/iceberg/pull/1373#issuecomment-683210910
I've not gone though the tests thoroughly yet, but the implementation looks great to me. Unfortunately, I think there's a major problem with adding sort orders: what if an older writer rewrites table metadata and drops the sort order list? For example, if I have a table that is sorted by `id ASC NULLS FIRST`, that would get sort order ID 0. If an older client writes, then that gets erased and the next reader assumes that the table is unordered. Erasing the sort order is an annoyance right now. But if we use the sort order for delete files and data files, then this could turn into a correctness problem. Say a table has a data file sorted by `id ASC NULLS FIRST` and labelled with `sort_order_id = 0` in manifests. When the sort order gets erased or replaced, there is no longer any tracking for what ID 0 meant. If I then add an unsorted equality delete file it may also get `sort_order_id = 0` in its metadata. Then it would appear that the deletes can be merged instead of using a delete set because the order matches between the data file and the delete file when it actually does not. I can think of two solutions. First, we could add sort orders to v1, but require them for v2 like we do for the partition spec list. That way, all v2 writers will guarantee that sort orders are not dropped and we no longer have a problem. The second option is to assign sort order IDs using a hash so two orders would not collide. I think I prefer the first option. Thoughts? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
