[I] Support Count(Distinct) (and similar) aggregation [arrow-datafusion-comet]

via GitHub Fri, 16 Feb 2024 09:25:39 -0800


sunchao opened a new issue, #38:
URL: https://github.com/apache/arrow-datafusion-comet/issues/38


   ### What is the problem the feature request solves?
   
   We should also support aggregations such as `count(distinct(col)) from tbl`. 
In Spark,
   
   ```sql
   SELECT COUNT(DISTINCT(_1)) FROM tbl
   ```
   
   produces a plan like the following:
   ```
   AdaptiveSparkPlan isFinalPlan=false
   +- HashAggregate(keys=[], functions=[count(distinct _1#9)], 
output=[count(DISTINCT _1)#16L])
      +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=40]
         +- HashAggregate(keys=[], functions=[partial_count(distinct _1#9)], 
output=[count#20L])
            +- HashAggregate(keys=[_1#9], functions=[], output=[_1#9])
               +- Exchange hashpartitioning(_1#9, 5), ENSURE_REQUIREMENTS, 
[plan_id=37]
                  +- HashAggregate(keys=[_1#9], functions=[], output=[_1#9])
                     +- Scan parquet [_1#9] Batched: true, DataFilters: [], 
Format: Parquet, Location: InMemoryFileIndex(1 paths)[..., PartitionFilters: 
[], PushedFilters: [], ReadSchema: struct<_1:int>
   ```
   
   ### Describe the potential solution
   
   Add the support for `COUNT(DISTINCT)` (and similar), so the Spark physical 
plan can be properly converted to a native plan and executed.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Support Count(Distinct) (and similar) aggregation [arrow-datafusion-comet]

Reply via email to