gene-bordegaray commented on issue #21992:
URL: https://github.com/apache/datafusion/issues/21992#issuecomment-4466649802
Continuing discussion from the PR here with @stuhood and @Dandandan to keep
things traceable for everyone involved. Both have raised question on why we
decided to introduce `ExprPartitioning` rather than regular
`RangePartitioning`. I have pointed to `ExprPartitioning` being more flexible
but rememeber we decided to pursue `Range` before `Expr` due to implementation
complexity and use cases (sorry have had lots going on this week 😅 ).
With this I have revisited the white boarding we did and thought about the
points brought up on the PR regarding repartitioning.
One option I am thinking about is a range representation that stores the
partitioning expressions, the split points for each expression, and a map from
range-grid cells to output partitions:
```text
p0: date in [2021-01-01, 2022-01-01) AND city in [Allston, Boston)
p1: date in [2021-01-01, 2022-01-01) AND city in [Boston, NYC)
p2: date in [2022-01-01, 2023-01-01) AND city in [Allston, Boston)
p3: date in [2022-01-01, 2023-01-01) AND city in [Boston, NYC)
Represented as:
exprs: [date, city]
date split points: [2021-01-01, 2022-01-01, 2023-01-01]
city split points: [Allston, Boston, NYC]
cell -> partition:
([2021-01-01, 2022-01-01), [Allston, Boston)) -> p0
([2021-01-01, 2022-01-01), [Boston, NYC)) -> p1
([2022-01-01, 2023-01-01), [Allston, Boston)) -> p2
([2022-01-01, 2023-01-01), [Boston, NYC)) -> p3
```
This is close to @alamb `[exprs] -> [[ranges]]` suggestion, but change a
bit for routing as @stuhood brought to my attention. Andrew’s form represents
each partition as a set of ranges while this form stores split points plus a
`cell -> partition` map. They are both able to represent the same things, but
the this makes repartitioning a bit more efficient I believe (please inform me
if I am missing something) since we evaluate the expressions, binary-search
each dimension’s split points, then look up the target partition instead of
scanning every partition’s range.
We could also keep Andrew’s `[exprs] -> [[ranges]]` representation as the
public representation, but build this form described above when routing is
needed. This might be nice as I think Andrew's representation is a bit more
readable but hides some implementation details from the user.
Sorry for the confusion guys, thank you!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]