rdblue opened a new pull request, #5131:
URL: https://github.com/apache/iceberg/pull/5131

   This updates `SortOrderUtil.buildSortOrder` to produce a correct clustering 
when building a final sort order. This method creates a new sort order that 
clusters rows for a given partition spec and produces the ordering that was 
passed in for rows within the same partition.
   
   Previously, this was done by checking whether the partition source fields 
were included in the sort order, and that the transform applied to the source 
data satisfied the partition field's sort order. This simple approach left a 
few cases where the clustering was not actually satisfied.
   
   For example: when partitioning by `[ day(ts), category ]` and sorting by `[ 
ts, category ]`, the original order would be used because `ts` satisfies the 
order of `days(ts)` and `category` satisfies the order for `category`. However, 
this fails for the following data:
   
   ```text
   2022-06-24T00:00:00.000000, a    -> Partition(2022-06-24, a)
   2022-06-24T00:00:00.000000, b    -> Partition(2022-06-24, b)
   2022-06-24T00:00:00.000001, a    -> Partition(2022-06-24, a)    <-- fails 
when writing from Spark
   2022-06-24T00:00:00.000001, b    -> Partition(2022-06-24, b)
   ```
   
   If the sort order were reversed, `[ category, ts ]` then the sort order does 
not fail.
   
   ```text
   2022-06-24T00:00:00.000000, a    -> Partition(2022-06-24, a)
   2022-06-24T00:00:00.000001, a    -> Partition(2022-06-24, a)
   2022-06-24T00:00:00.000000, b    -> Partition(2022-06-24, b)
   2022-06-24T00:00:00.000001, b    -> Partition(2022-06-24, b)
   ```
   
   The problem is that a sort field that satisfies a partition field's ordering 
but is more granular can only be used as the last field to produce the 
partition clustering. If another partition field comes after, the clustering is 
no longer valid.
   
   This PR fixes the problem by changing how the partition clustering prefix is 
constructed:
   1. Build a map of the required clustering fields
   2. For each sort field, check if it matches the transform/source of a 
required clustering field
   3. For fields that match a clustering field, remove that field because it is 
satisfied by the ordering
   4. Once a sort field that does not match a clustering field is found, stop 
considering sort fields
   5. For the last field that does not match a clustering field, check if it 
satisfies the ordering of a clustering field and, if so, remove the clustering 
field
   6. Add any remaining clustering fields to the start of the final order
   
   For the example above with `[ days(ts), category ]` and sort order `[ ts, 
category ]`:
   1. The required clustering fields are `days(ts)` and `category`.
   2. There are no fields at the start of the sort order that exactly match a 
required clustering field
   3. The sort field `ts` satisfies the order of `days(ts)`, so that is removed
   4. The remaining required clustering field is `category`
   5. `category` is added to the beginning of the sort order
   6. The final sort order is `[ category, ts, category ]`, which satisfies the 
required partition clustering and retains the incoming ordering within 
partitions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to