[GitHub] [arrow-datafusion] alamb commented on issue #6983: [DataFrame] Parallel Load into dataframe

via GitHub Thu, 03 Aug 2023 08:27:16 -0700


alamb commented on issue #6983:
URL: 
https://github.com/apache/arrow-datafusion/issues/6983#issuecomment-1664194344


   I think there is something in the physical planning that assumes the result 
of the final plan should be in a single partition (or at least it won't expand 
it when adding additional partioning). because when connecting to a client this 
is what makes the most sense
   
   I believe this is controlled by 
`ExecutionPlan::benefits_from_input_partitioning`: 
https://github.com/apache/arrow-datafusion/blob/6a2d4a3a254c0495a398608d178496a191450750/datafusion/core/src/physical_plan/mod.rs#L169-L183
   
   `ProjectionExec` returns false if it is only columns (which is what these 
queries are doing)
   
   
https://github.com/apache/arrow-datafusion/blob/6a2d4a3a254c0495a398608d178496a191450750/datafusion/core/src/physical_plan/projection.rs#L285
   
   So the reason the filter case goes faster is that the filter is that the 
filter will return `true` for benefits from repartitioning but the Partition 
won't.
   
   I wonder if we could somehow add a flag to `LogicalProjection` / 
`ProjectionExec` that says "always benefits from repartitioning somehow and set 
that flag for the writes 🤔 
   
   Alternately, I was thinking the `ExecutionPlan that does the writing could 
say "I want the input partitioned" and the optimizer would do the right thing.  
But given the DataFrame API doesn't use an ExecutionPlan for writing it might 
not work. 
   
   Thank you both for pushing on this -- it is going to be awesome to get this 
working correctly
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #6983: [DataFrame] Parallel Load into dataframe

Reply via email to