marton-bod commented on pull request #3377:
URL: https://github.com/apache/iceberg/pull/3377#issuecomment-954691017


   Thanks for reviewing @aokolnychyi! You're right about the spec mandating 
that delete entries must be sorted by `file_path` and `file_pos`. That's what 
we are doing on the Hive side as well, but came across the problem that since 
data files can be added via the API with any arbitrary name, an alphabetical 
sort of the `file_paths` could still lead to out of order partition values. 
   
   As for the `spec_id` and `partition` columns, to be honest I kinda missed 
that they have been added to the `MetadataColumns` :) I haven't tried them out 
yet, but I'm assuming you could include them too in the sort (e.g. `sort by 
spec_id, partition, file_path, file_pos`) and have your data perfectly 
clustered with their help.
   
   I think there might still be some utility in keeping this writer 
implementation as well for the problems similar to the one described above, but 
I'll leave it up to you. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to