matthewmturner commented on a change in pull request #1063:
URL: https://github.com/apache/arrow-datafusion/pull/1063#discussion_r747044734
##########
File path: datafusion/src/physical_optimizer/aggregate_statistics.rs
##########
@@ -153,6 +159,39 @@ fn take_optimizable_count(
None
}
+/// If this agg_expr is a count that can be derived from the statistics,
return it
+fn take_optimizable_count_with_nulls(
+ agg_expr: &dyn AggregateExpr,
+ stats: &Statistics,
+) -> Option<(ScalarValue, String)> {
+ if let (Some(num_rows), Some(col_stats), Some(casted_expr)) = (
+ stats.num_rows,
+ &stats.column_statistics,
+ agg_expr.as_any().downcast_ref::<expressions::Count>(),
+ ) {
+ if casted_expr.expressions().len() == 1 {
Review comment:
My understanding is that `COUNT(*)` doesnt need to have a separate
handler for nulls - assuming we expect same behavior as psql. For example in
psql when i do the following:
```
postgres=# create table foo as select * from (values
(1,NULL),(NULL,2),(3,3)) as sq;
SELECT 3
postgres=# select * from foo;
column1 | column2
---------+---------
1 |
| 2
3 | 3
(3 rows)
postgres=# select count(*) from foo;
count
-------
3
(1 row)
```
Does it make sense to reframe these optimizations as the following:
`take_optimizable_table_count` (current `take_optimizable_count`)=> comes
from `COUNT(*)` and returns `num_rows`
`take_optimizable_column_count` (current
`take_optimizable_count_with_nulls`) => comes from `COUNT(col)` and return
`num_rows - null_count` for `col`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]