gruuya opened a new pull request, #7180:
URL: https://github.com/apache/arrow-datafusion/pull/7180

   # Which issue does this PR close?
   
   Partially addresses #7149 
   
   # Rationale for this change
   
   Try to take advantage of the `fetch` count known to the `SortExec` to reduce 
the size of the sorted batches that are later merged.
   
   # What changes are included in this PR?
   
   Instead of accumulating the full batches prior to sorting/spilling them in 
preparation for the merge-sort, try to do the sorting ahead of time on each 
incoming batch inside of the `ExternalSorter`.
   
   # Are these changes tested?
   
   The existing sort-spill tests pass. As for the timing and memory 
implications, using the setup:
   - `jemallocator::Jemalloc` for the global allocator in order to record the 
memory profiles using bytehound
   - 
`https://seafowl-public.s3.eu-west-1.amazonaws.com/tutorial/trase-supply-chains.parquet`
 as target for the external table `CREATE EXTERNAL TABLE supply_chains STORED 
AS PARQUET LOCATION '/home/ubuntu/supply-chains.parquet';`
   - run `SELECT * FROM supply_chains ORDER BY flow_id DESC LIMIT K` for K=1, 
10, 100, 1000
   
   
    I've recorded the following:
   
   
   1. current main
   <img width="1395" alt="slika" 
src="https://github.com/apache/arrow-datafusion/assets/45558892/b0fa3351-f8a2-48b8-af85-1aed4e47186d";>
   2. this PR
   <img width="1380" alt="slika" 
src="https://github.com/apache/arrow-datafusion/assets/45558892/a648e88b-c444-49a2-9a5e-6d50f516fae1";>
   
   
   # Are there any user-facing changes?
   
   Only runtime/memory profiles.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to