gene-bordegaray commented on issue #21992:
URL: https://github.com/apache/datafusion/issues/21992#issuecomment-4466649802

   Continuing discussion from the PR here with @stuhood and @Dandandan to keep 
things traceable for everyone involved. Both have raised question on why we 
decided to introduce `ExprPartitioning` rather than regular 
`RangePartitioning`. I have pointed to `ExprPartitioning` being more flexible 
but rememeber we decided to pursue `Range` before `Expr` due to implementation 
complexity and use cases (sorry have had lots going on this week 😅 ).
   
   With this I have revisited the white boarding we did and thought about the 
points brought up on the PR regarding repartitioning. 
   
   One option I am thinking about is a range representation that stores the 
partitioning expressions, the split points for each expression, and a map from 
range-grid cells to output partitions:
   
   ```text
   p0: date in [2021-01-01, 2022-01-01) AND city in [Allston, Boston)
   p1: date in [2021-01-01, 2022-01-01) AND city in [Boston, NYC)
   p2: date in [2022-01-01, 2023-01-01) AND city in [Allston, Boston)
   p3: date in [2022-01-01, 2023-01-01) AND city in [Boston, NYC)
   
   Represented as:
   
   exprs: [date, city]
   date split points: [2021-01-01, 2022-01-01, 2023-01-01]
   city split points: [Allston, Boston, NYC]
   
   cell -> partition:
     ([2021-01-01, 2022-01-01), [Allston, Boston)) -> p0
     ([2021-01-01, 2022-01-01), [Boston, NYC))     -> p1
     ([2022-01-01, 2023-01-01), [Allston, Boston)) -> p2
     ([2022-01-01, 2023-01-01), [Boston, NYC))     -> p3
   ```
   This is close to @alamb  `[exprs] -> [[ranges]]` suggestion, but change a 
bit for routing as @stuhood brought to my attention. Andrew’s form represents 
each partition as a set of ranges while this form stores split points plus a 
`cell -> partition` map. They are both able to represent the same things, but 
the this makes repartitioning a bit more efficient I believe (please inform me 
if I am missing something) since we evaluate the expressions, binary-search 
each dimension’s split points, then look up the target partition instead of 
scanning every partition’s range.
   
   We could also keep Andrew’s `[exprs] -> [[ranges]]` representation as the 
public representation, but build this form described above when routing is 
needed. This might be nice as I think Andrew's representation is a bit more 
readable but hides some implementation details from the user.
   
   Sorry for the confusion guys, thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to