alamb commented on issue #17169: URL: https://github.com/apache/datafusion/issues/17169#issuecomment-3191783825
Indeed -- if you already know the data is sorted on the `DISTINCT ON` keys, you can do deduplication with a single pass through the data with minimal memory requirements, following the suggestion from @connec This is the exact strategy we use in InfluxDB IOx to deduplicate (which is how we implement insert order resolution) The code for it is here - https://github.com/influxdata/influxdb3_core/blob/main/iox_query/src/provider/deduplicate.rs The biggest challenge is having the data sorted by the `DISTINCT ON` keys as (re)sorting large datasets is also resource intensive (memory, CPU and potentially IO) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org