marton-bod commented on pull request #3377: URL: https://github.com/apache/iceberg/pull/3377#issuecomment-954691017
Thanks for reviewing @aokolnychyi! You're right about the spec mandating that delete entries must be sorted by `file_path` and `file_pos`. That's what we are doing on the Hive side as well, but came across the problem that since data files can be added via the API with any arbitrary name, an alphabetical sort of the `file_paths` could still lead to out of order partition values. As for the `spec_id` and `partition` columns, to be honest I kinda missed that they have been added to the `MetadataColumns` :) I haven't tried them out yet, but I'm assuming you could include them too in the sort (e.g. `sort by spec_id, partition, file_path, file_pos`) and have your data perfectly clustered with their help. I think there might still be some utility in keeping this writer implementation as well for the problems similar to the one described above, but I'll leave it up to you. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
