alamb commented on issue #17169:
URL: https://github.com/apache/datafusion/issues/17169#issuecomment-3191783825

   Indeed -- if you already know the data is sorted on the `DISTINCT ON` keys, 
you can do deduplication with a single pass through the data with minimal 
memory requirements, following the suggestion from @connec 
   
   This is the exact strategy we use in InfluxDB IOx to deduplicate (which is 
how we implement insert order resolution)
   
   The code for it is here
   - 
https://github.com/influxdata/influxdb3_core/blob/main/iox_query/src/provider/deduplicate.rs
   
   The biggest challenge is having the data sorted by the `DISTINCT ON` keys as 
(re)sorting large datasets is also resource intensive (memory, CPU and 
potentially IO)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to