[GitHub] [arrow-datafusion] alamb opened a new pull request, #6310: Improve parallelism of repartition operator

via GitHub Tue, 09 May 2023 08:18:28 -0700


alamb opened a new pull request, #6310:
URL: https://github.com/apache/arrow-datafusion/pull/6310


   # Which issue does this PR close?
   
   Close https://github.com/apache/arrow-datafusion/issues/6290
   
   # Rationale for this change
   
   I was testing query performance for 
https://github.com/apache/arrow-datafusion/issues/6278 and noticed that only. a 
single core was being used on a query entirely in memory. When I spent some 
time looking into it, the plan looked correct with repartitioning but for some 
reason it wasn't properly repartitoning 
   
   # What changes are included in this PR?
   
   Don't yield on *every* batch -- yield only after we have made some decent 
progress (in this case at least `partition_count` batches)
   
   # Are these changes tested?
   I manually tested this -- I will add a benchmark for it shortly
   
   My manual test results are:
   
   On main (Keeps only 1 core busy for most of the time)
   ```
   4 rows in set. Query took 10.631 seconds.
   ```
   
   With this PR (keeps the cores much busier)
   ```
   4 rows in set. Query took 3.705 seconds.
   ```
   
   
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be 
updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new pull request, #6310: Improve parallelism of repartition operator

Reply via email to