rdblue opened a new pull request, #5131: URL: https://github.com/apache/iceberg/pull/5131
This updates `SortOrderUtil.buildSortOrder` to produce a correct clustering when building a final sort order. This method creates a new sort order that clusters rows for a given partition spec and produces the ordering that was passed in for rows within the same partition. Previously, this was done by checking whether the partition source fields were included in the sort order, and that the transform applied to the source data satisfied the partition field's sort order. This simple approach left a few cases where the clustering was not actually satisfied. For example: when partitioning by `[ day(ts), category ]` and sorting by `[ ts, category ]`, the original order would be used because `ts` satisfies the order of `days(ts)` and `category` satisfies the order for `category`. However, this fails for the following data: ```text 2022-06-24T00:00:00.000000, a -> Partition(2022-06-24, a) 2022-06-24T00:00:00.000000, b -> Partition(2022-06-24, b) 2022-06-24T00:00:00.000001, a -> Partition(2022-06-24, a) <-- fails when writing from Spark 2022-06-24T00:00:00.000001, b -> Partition(2022-06-24, b) ``` If the sort order were reversed, `[ category, ts ]` then the sort order does not fail. ```text 2022-06-24T00:00:00.000000, a -> Partition(2022-06-24, a) 2022-06-24T00:00:00.000001, a -> Partition(2022-06-24, a) 2022-06-24T00:00:00.000000, b -> Partition(2022-06-24, b) 2022-06-24T00:00:00.000001, b -> Partition(2022-06-24, b) ``` The problem is that a sort field that satisfies a partition field's ordering but is more granular can only be used as the last field to produce the partition clustering. If another partition field comes after, the clustering is no longer valid. This PR fixes the problem by changing how the partition clustering prefix is constructed: 1. Build a map of the required clustering fields 2. For each sort field, check if it matches the transform/source of a required clustering field 3. For fields that match a clustering field, remove that field because it is satisfied by the ordering 4. Once a sort field that does not match a clustering field is found, stop considering sort fields 5. For the last field that does not match a clustering field, check if it satisfies the ordering of a clustering field and, if so, remove the clustering field 6. Add any remaining clustering fields to the start of the final order For the example above with `[ days(ts), category ]` and sort order `[ ts, category ]`: 1. The required clustering fields are `days(ts)` and `category`. 2. There are no fields at the start of the sort order that exactly match a required clustering field 3. The sort field `ts` satisfies the order of `days(ts)`, so that is removed 4. The remaining required clustering field is `category` 5. `category` is added to the beginning of the sort order 6. The final sort order is `[ category, ts, category ]`, which satisfies the required partition clustering and retains the incoming ordering within partitions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
