NGA-TRAN commented on issue #18777:
URL: https://github.com/apache/datafusion/issues/18777#issuecomment-3547364949

   @gene-bordegaray
   
   > For each input partition, check the min/max ranges to see if they are 
overlapping.
   
   This approach won’t work in most cases. Let’s think it through:
   
   1. Min/max ranges are runtime values.
         - They aren’t available at planning time.
         - You might know them for a leaf node (e.g., DataSourceExec) if exact 
statistics exist.
         - For non-leaf nodes, you won’t have them—and that’s usually where we 
need this information.
   2. File distribution complicates things.
        - Typically, multiple files need to be read.
        - To balance work across partitions/streams and avoid skew, files are 
distributed across them.
        - This means non-contiguous files may end up in the same 
partition/stream.
        - As a result, partitions/streams can have overlapping ranges, even 
though each contains distinct keys, and we can still use FinalPartitioned 
aggregation.
   
   Just like sort order, this should be treated as a data-layout/table 
property—something we either know or discover at planning time and then 
propagate upward from the leaf node.
   
   It’s not a trivial feature, but it’s definitely doable. The path forward is 
to follow the same approach we use for sort properties.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to