rdblue commented on pull request #1373:
URL: https://github.com/apache/iceberg/pull/1373#issuecomment-683210910


   I've not gone though the tests thoroughly yet, but the implementation looks 
great to me.
   
   Unfortunately, I think there's a major problem with adding sort orders: what 
if an older writer rewrites table metadata and drops the sort order list? For 
example, if I have a table that is sorted by `id ASC NULLS FIRST`, that would 
get sort order ID 0. If an older client writes, then that gets erased and the 
next reader assumes that the table is unordered.
   
   Erasing the sort order is an annoyance right now. But if we use the sort 
order for delete files and data files, then this could turn into a correctness 
problem. Say a table has a data file sorted by `id ASC NULLS FIRST` and 
labelled with `sort_order_id = 0` in manifests. When the sort order gets erased 
or replaced, there is no longer any tracking for what ID 0 meant. If I then add 
an unsorted equality delete file it may also get `sort_order_id = 0` in its 
metadata. Then it would appear that the deletes can be merged instead of using 
a delete set because the order matches between the data file and the delete 
file when it actually does not.
   
   I can think of two solutions. First, we could add sort orders to v1, but 
require them for v2 like we do for the partition spec list. That way, all v2 
writers will guarantee that sort orders are not dropped and we no longer have a 
problem. The second option is to assign sort order IDs using a hash so two 
orders would not collide.
   
   I think I prefer the first option. Thoughts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to