alamb commented on issue #6983: URL: https://github.com/apache/arrow-datafusion/issues/6983#issuecomment-1664194344
I think there is something in the physical planning that assumes the result of the final plan should be in a single partition (or at least it won't expand it when adding additional partioning). because when connecting to a client this is what makes the most sense I believe this is controlled by `ExecutionPlan::benefits_from_input_partitioning`: https://github.com/apache/arrow-datafusion/blob/6a2d4a3a254c0495a398608d178496a191450750/datafusion/core/src/physical_plan/mod.rs#L169-L183 `ProjectionExec` returns false if it is only columns (which is what these queries are doing) https://github.com/apache/arrow-datafusion/blob/6a2d4a3a254c0495a398608d178496a191450750/datafusion/core/src/physical_plan/projection.rs#L285 So the reason the filter case goes faster is that the filter is that the filter will return `true` for benefits from repartitioning but the Partition won't. I wonder if we could somehow add a flag to `LogicalProjection` / `ProjectionExec` that says "always benefits from repartitioning somehow and set that flag for the writes 🤔 Alternately, I was thinking the `ExecutionPlan that does the writing could say "I want the input partitioned" and the optimizer would do the right thing. But given the DataFrame API doesn't use an ExecutionPlan for writing it might not work. Thank you both for pushing on this -- it is going to be awesome to get this working correctly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
