tustvold commented on pull request #378: URL: https://github.com/apache/arrow-datafusion/pull/378#issuecomment-847158217
I did some more digging into this and created #412 to track the fact that PhysicalPlanner currently creates plans that are incorrect. However, I think the issue is actually a bit more subtle than I first realised. Currently `Repartition` may insert `RepartitionExec` between an operator and its children, provided that operator doesn't require a single partition. It is then reliant on a later optimisation pass with AddMergeExec to join together the partitions if a operator further up the tree requires it. This means that the operators inserted by PhysicalPlan must somehow remember the partitioning they need to be correct, in order to prevent the optimiser from breaking them, simply adding MergeExec when generating the initial plan is insufficient. There are a couple of ways this gets handled that I can see: * Limit has two separate operators - `GlobalLimitExec` and `LocalLimitExec` * `HashAggregateExec` has an `AggregateMode` enumeration I therefore think the addition of a `preserve_partitioning` flag to `SortExec` is necessary and has precedent. However, it is unfortunately insufficient as nothing prevents `Repartition` from repartitioning a sorted partition (I think this might be an issue more generally). I need to think on this more, perhaps as @alamb mentioned on #379 there needs to be a concept of sorted-ness introduced for operators that optimisation passes such as `Repartition` and `AddMergeExec` would respect. Going to mark this as a draft for now, as the above will have implications for what the best way forward for this is -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
