[GitHub] [arrow-ballista] andygrove opened a new issue, #608: Implement optimizer rule to remove redundant repartitioning

GitBox Mon, 16 Jan 2023 10:08:06 -0800


andygrove opened a new issue, #608:
URL: https://github.com/apache/arrow-ballista/issues/608


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   When running benchmark q2 I see this in the plan:
   
   ```
   RepartitionExec: partitioning=Hash([Column { name: "r_regionkey", index: 0 
}], 48), metrics=[fetch_time=69.072358ms, send_time=48ns, repart_time=48ns]
     RepartitionExec: partitioning=RoundRobinBatch(48), 
metrics=[fetch_time=273.515µs, send_time=1ns, repart_time=1ns]
   ```
   
   We introduce the `RoundRobinBatch` partitioning to increase parallelism but 
then we insert the hash partitioning later for the join. It is redundant to 
repartition twice here, and it would be better to just preparation on the join 
key right away.
   
   Note that this repartitioning is especially expensive for distributed query 
engines that build on top of DataFusion (Dask SQL, Ballista) because it 
introduces extra shuffles.
   
   **Describe the solution you'd like**
   Implement new rule to push hash partitioning down and replace the round 
robin partitioning.
   
   **Describe alternatives you've considered**
   None
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-ballista] andygrove opened a new issue, #608: Implement optimizer rule to remove redundant repartitioning

Reply via email to