brancz opened a new issue, #21051: URL: https://github.com/apache/datafusion/issues/21051
### Is your feature request related to a problem or challenge? Optimize queries like ```sql SELECT count(DISTINCT col) > 1 FROM table; ``` For cases where col has high cardinality. The issue is that currently the distinct accumulates all values first and only then are they counted, so we both insert values into the distinct accumulator, leaving CPU on the table, and we pay for memory unnecessarily. ### Describe the solution you'd like Add a bound to count distinct and only insert into the tracked values collection up to N+1. ### Describe alternatives you've considered n/a ### Additional context I have a prototype of this working, if the general idea sounds good, I can clean it up and submit the PR. @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
