Re: [I] Disproportionate memory use for `DISTINCT ON` query [datafusion]

via GitHub Fri, 15 Aug 2025 08:23:38 -0700


alamb commented on issue #17169:
URL: https://github.com/apache/datafusion/issues/17169#issuecomment-3191783825


   Indeed -- if you already know the data is sorted on the `DISTINCT ON` keys, 
you can do deduplication with a single pass through the data with minimal 
memory requirements, following the suggestion from @connec 
   
   This is the exact strategy we use in InfluxDB IOx to deduplicate (which is 
how we implement insert order resolution)
   
   The code for it is here
   - 
https://github.com/influxdata/influxdb3_core/blob/main/iox_query/src/provider/deduplicate.rs
   
   The biggest challenge is having the data sorted by the `DISTINCT ON` keys as 
(re)sorting large datasets is also resource intensive (memory, CPU and 
potentially IO)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Disproportionate memory use for `DISTINCT ON` query [datafusion]

Reply via email to