[GitHub] [arrow-datafusion] alamb commented on pull request #7250: Request for Comment: Native `TopK` Operator

via GitHub Wed, 23 Aug 2023 06:05:30 -0700


alamb commented on PR #7250:
URL: 
https://github.com/apache/arrow-datafusion/pull/7250#issuecomment-1689929860


   Update: 
   1. I updated the core algorithm to use a BinaryHeap from the rust std 
library and that works very well, and goes faster than `main` for `LIMIT 10000` 
(aka "large" k type queries), including the worst case / adversarial where the 
data is reverse sorted
   2. I have started working on "compaction" to improve memory usage for "large 
k" type queries
   
   Current status
   
   | Query Type | time / CPU compared to `main` | memory compared to main |
   |--------|--------|--------|
   | `select * from 'traces_nd_adversarial.parquet' order by time desc limit 
10` | better | better |
   | `select * from 'traces_nd_adversarial.parquet' order by time desc limit 
10000` | better | **SAME** |
   | `select * from 'traces.parquet' order by time desc limit 10` | better | 
better | 
   | `select * from 'traces.parquet' order by time desc limit 10000` | better | 
 **SAME** | 
   
   
   Current remaining todos:
   
   - [ ] Improve memory usage compared to main for "large k" (via "compaction")
   - [ ] debug some issues with high cardinality dictionaries
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on pull request #7250: Request for Comment: Native `TopK` Operator

Reply via email to