[GitHub] [arrow-datafusion] alamb commented on pull request #7192: Create a Priority Queue based Aggregation with `limit`

via GitHub Fri, 11 Aug 2023 11:54:10 -0700


alamb commented on PR #7192:
URL: 
https://github.com/apache/arrow-datafusion/pull/7192#issuecomment-1675226071


   I was thinking about this problem on a walk.  I apologize if this sounds 
super similar to the topK 
inhttps://github.com/apache/arrow-datafusion/pull/7250, but you know when you 
have a hammer 🔨 ....
   
   
   What if you formulated this problem by keeping a list of values, sorted by 
time, like this:
   
   ```rust
   /// store the topk values by time. Each entry in topk_values
   /// has a unique trace_id
   topk_values: Vec<(time, trace_id)>
   ```
   
   
   So then each new row can very quickly be checked for "is this one of the top 
k time values" by checking its time value to the last time value in 
`topk_values` and skipping it if so (no hash manipulation required)
   
   If it turns out that the new value *does* belong in the topk_values you 
would have to figure out if it was an update to an existing entry (a previously 
seen `trace_id`) or it should replace an existing entry, based on its 
`trace_id` 
   
   You could potentially do so with a structure that maps trace_ids to the 
entry in `topk_values` -- perhaps like this (or some fancier structure):
   
   ```rust
   trace_ids: HashMap<trace_id, size>;
   ```
   
   This approach would likely be quite fast as many values can be discarded 
with a cheap comparison (no hashing, etc)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on pull request #7192: Create a Priority Queue based Aggregation with `limit`

Reply via email to