jayzhan211 opened a new issue, #15850:
URL: https://github.com/apache/datafusion/issues/15850

   ### Is your feature request related to a problem or challenge?
   
   ```
   statement count 0
   create table t(a int) as values (1), (2);
   
   query I
   select count(distinct a) from t; 
   ----
   2
   
   query TT
   explain
   select count(distinct a) from t; 
   ----
   logical_plan
   01)Projection: count(alias1) AS count(DISTINCT t.a)
   02)--Aggregate: groupBy=[[]], aggr=[[count(alias1)]]
   03)----Aggregate: groupBy=[[t.a AS alias1]], aggr=[[]]
   04)------TableScan: t projection=[a]
   physical_plan
   01)ProjectionExec: expr=[count(alias1)@0 as count(DISTINCT t.a)]
   02)--AggregateExec: mode=Final, gby=[], aggr=[count(alias1)]
   03)----CoalescePartitionsExec
   04)------AggregateExec: mode=Partial, gby=[], aggr=[count(alias1)]
   05)--------AggregateExec: mode=FinalPartitioned, gby=[alias1@0 as alias1], 
aggr=[]
   06)----------CoalesceBatchesExec: target_batch_size=8192
   07)------------RepartitionExec: partitioning=Hash([alias1@0], 4), 
input_partitions=4
   08)--------------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
   09)----------------AggregateExec: mode=Partial, gby=[a@0 as alias1], aggr=[]
   10)------------------DataSourceExec: partitions=1, partition_sizes=[1]
   
   ```
   
   I think we should execute with the specialized count distinct accumualator 
like `PrimitiveDistinctCountAccumulator`, `BytesDistinctCountAccumulator`, 
`FloatDistinctCountAccumulator`. Current execution path looks quite complex and 
probably not that optimized.
   
   ### Describe the solution you'd like
   
   Investigate why distinct count accumulator is not called and whether 
switching to it improves the code.
   
   ClickBench has query like count(distinct), so we could benchmark against it 
to see if the improvement works
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to