gruuya opened a new pull request, #7180: URL: https://github.com/apache/arrow-datafusion/pull/7180
# Which issue does this PR close? Partially addresses #7149 # Rationale for this change Try to take advantage of the `fetch` count known to the `SortExec` to reduce the size of the sorted batches that are later merged. # What changes are included in this PR? Instead of accumulating the full batches prior to sorting/spilling them in preparation for the merge-sort, try to do the sorting ahead of time on each incoming batch inside of the `ExternalSorter`. # Are these changes tested? The existing sort-spill tests pass. As for the timing and memory implications, using the setup: - `jemallocator::Jemalloc` for the global allocator in order to record the memory profiles using bytehound - `https://seafowl-public.s3.eu-west-1.amazonaws.com/tutorial/trase-supply-chains.parquet` as target for the external table `CREATE EXTERNAL TABLE supply_chains STORED AS PARQUET LOCATION '/home/ubuntu/supply-chains.parquet';` - run `SELECT * FROM supply_chains ORDER BY flow_id DESC LIMIT K` for K=1, 10, 100, 1000 I've recorded the following: 1. current main <img width="1395" alt="slika" src="https://github.com/apache/arrow-datafusion/assets/45558892/b0fa3351-f8a2-48b8-af85-1aed4e47186d"> 2. this PR <img width="1380" alt="slika" src="https://github.com/apache/arrow-datafusion/assets/45558892/a648e88b-c444-49a2-9a5e-6d50f516fae1"> # Are there any user-facing changes? Only runtime/memory profiles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
