tustvold edited a comment on pull request #378:
URL: https://github.com/apache/arrow-datafusion/pull/378#issuecomment-847158217


   I did some more digging into this and created #412 to track the fact that 
PhysicalPlanner currently creates plans that are incorrect.
   
   However, I think the issue is actually a bit more subtle than I first 
realised. Currently `Repartition` may insert `RepartitionExec` between an 
operator and its children, provided that operator doesn't require a single 
partition. It is then reliant on a later optimisation pass with `AddMergeExec` 
to join together the partitions if a operator further up the tree requires it.
   
   This means that the operators inserted by PhysicalPlan must somehow remember 
the partitioning they need to be correct, in order to prevent the optimiser 
from breaking them, simply adding MergeExec when generating the initial plan is 
insufficient.
   
   There are a couple of ways this gets handled that I can see:
   
   * Limit has two separate operators - `GlobalLimitExec` and `LocalLimitExec`
   * `HashAggregateExec` has an `AggregateMode` enumeration
   
   I therefore think the addition of a `preserve_partitioning` flag to 
`SortExec` is necessary and has precedent.
   
   However, it is unfortunately insufficient as nothing prevents `Repartition` 
from repartitioning a sorted partition (I think this might be an issue more 
generally). I need to think on this more, perhaps as @alamb mentioned on #379 
there needs to be a concept of sorted-ness introduced for operators that 
optimisation passes such as `Repartition` and `AddMergeExec` would respect.
   
   Going to mark this as a draft for now, as the above will have implications 
for what the best way forward for this is


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to