[I] Add bounded distinct count optimization [datafusion]

via GitHub Thu, 19 Mar 2026 05:42:04 -0700


brancz opened a new issue, #21051:
URL: https://github.com/apache/datafusion/issues/21051


   ### Is your feature request related to a problem or challenge?
   
   Optimize queries like
   
   ```sql
   SELECT count(DISTINCT col) > 1 FROM table;
   ```
   
   For cases where col has high cardinality. The issue is that currently the 
distinct accumulates all values first and only then are they counted, so we 
both insert values into the distinct accumulator, leaving CPU on the table, and 
we pay for memory unnecessarily.
   
   ### Describe the solution you'd like
   
   Add a bound to count distinct and only insert into the tracked values 
collection up to N+1.
   
   ### Describe alternatives you've considered
   
   n/a
   
   ### Additional context
   
   I have a prototype of this working, if the general idea sounds good, I can 
clean it up and submit the PR.
   
   @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Add bounded distinct count optimization [datafusion]

Reply via email to