gruuya opened a new issue, #9943:
URL: https://github.com/apache/arrow-datafusion/issues/9943

   ### Describe the bug
   
   As per the standard `COUNT(1)` is equivalent to the more common `COUNT(*)`, 
however this is not the case now, since the latter has the following sentinel 
value:
   
https://github.com/apache/arrow-datafusion/blob/4bd7c137e0e205140e273a7c25824c94b457c660/datafusion/expr/src/utils.rs#L43-L45
   whereas the former resolves to `COUNT(Int64(1))`.
   
   In turn this means this optimization is not being used
   
https://github.com/apache/arrow-datafusion/blob/4bd7c137e0e205140e273a7c25824c94b457c660/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L158-L162
   
   ### To Reproduce
   
   ```bash
   gruuya@markogrujics-MacBook-Pro datafusion-cli % 
DATAFUSION_EXECUTION_COLLECT_STATISTICS=true datafusion-cli
   DataFusion CLI v36.0.0
   ❯ create external table hits stored as parquet location 
'benchmarks/data/hits.parquet';
   0 rows in set. Query took 0.030 seconds.
   
   ❯ explain select count(*) from hits;
   
+---------------+---------------------------------------------------------------+
   | plan_type     | plan                                                       
   |
   
+---------------+---------------------------------------------------------------+
   | logical_plan  | Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS 
COUNT(*)]] |
   |               |   TableScan: hits projection=[]                            
   |
   | physical_plan | ProjectionExec: expr=[99997497 as COUNT(*)]                
   |
   |               |   PlaceholderRowExec                                       
   |
   |               |                                                            
   |
   
+---------------+---------------------------------------------------------------+
   2 rows in set. Query took 0.039 seconds.
   
   ❯ explain select count(1) from hits;
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                              |
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Aggregate: groupBy=[[]], aggr=[[COUNT(Int64(1))]]          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                              |
   |               |   TableScan: hits projection=[]                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                              |
   | physical_plan | AggregateExec: mode=Final, gby=[], aggr=[COUNT(Int64(1))]  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                              |
   |               |   CoalescePartitionsExec                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                              |
   |               |     AggregateExec: mode=Partial, gby=[], 
aggr=[COUNT(Int64(1))]                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
|
   |               |       ParquetExec: file_groups={12 groups: 
[[Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:0..1231664704],
 
[Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:1231664704..2463329408],
 
[Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:2463329408..3694994112],
 
[Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:3694994112..4926658816],
 
[Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:4926658816..6158323520],
 ...]} |
   |               |                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                              |
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.011 seconds.
   ```
   
   Given that `count(*)` basically just prints out the metadata, while 
`count(1)` does some additional work, there's also a an apparent performance 
implication, albeit negligible, with `count(1)` being 10x slower:
   ```
   ❯ select count(*) from hits;
   +----------+
   | COUNT(*) |
   +----------+
   | 99997497 |
   +----------+
   1 row in set. Query took 0.004 seconds.
   
   ❯ select count(1) from hits;
   +-----------------+
   | COUNT(Int64(1)) |
   +-----------------+
   | 99997497        |
   +-----------------+
   1 row in set. Query took 0.043 seconds.
   ```
   
   ### Expected behavior
   
   `COUNT(1)` and `COUNT(*)` should be interchangable and result in the same 
logical and physical plans (save for the column alias).
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to