alamb commented on PR #7192: URL: https://github.com/apache/arrow-datafusion/pull/7192#issuecomment-1675226071
I was thinking about this problem on a walk. I apologize if this sounds super similar to the topK inhttps://github.com/apache/arrow-datafusion/pull/7250, but you know when you have a hammer 🔨 .... What if you formulated this problem by keeping a list of values, sorted by time, like this: ```rust /// store the topk values by time. Each entry in topk_values /// has a unique trace_id topk_values: Vec<(time, trace_id)> ``` So then each new row can very quickly be checked for "is this one of the top k time values" by checking its time value to the last time value in `topk_values` and skipping it if so (no hash manipulation required) If it turns out that the new value *does* belong in the topk_values you would have to figure out if it was an update to an existing entry (a previously seen `trace_id`) or it should replace an existing entry, based on its `trace_id` You could potentially do so with a structure that maps trace_ids to the entry in `topk_values` -- perhaps like this (or some fancier structure): ```rust trace_ids: HashMap<trace_id, size>; ``` This approach would likely be quite fast as many values can be discarded with a cheap comparison (no hashing, etc) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
