[GitHub] [arrow-datafusion] matthewmturner commented on a change in pull request #1063: Optimize count agg expr with null column statistics

GitBox Wed, 10 Nov 2021 14:48:33 -0800


matthewmturner commented on a change in pull request #1063:
URL: https://github.com/apache/arrow-datafusion/pull/1063#discussion_r747044734




##########
File path: datafusion/src/physical_optimizer/aggregate_statistics.rs
##########
@@ -153,6 +159,39 @@ fn take_optimizable_count(
     None
 }
 
+/// If this agg_expr is a count that can be derived from the statistics, 
return it
+fn take_optimizable_count_with_nulls(
+    agg_expr: &dyn AggregateExpr,
+    stats: &Statistics,
+) -> Option<(ScalarValue, String)> {
+    if let (Some(num_rows), Some(col_stats), Some(casted_expr)) = (
+        stats.num_rows,
+        &stats.column_statistics,
+        agg_expr.as_any().downcast_ref::<expressions::Count>(),
+    ) {
+        if casted_expr.expressions().len() == 1 {

Review comment:
       My understanding is that `COUNT(*)` doesnt need to have a separate 
handler for nulls - assuming we expect same behavior as psql.  For example in 
psql when i do the following:
   ```
   postgres=# create table foo as select * from (values 
(1,NULL),(NULL,2),(3,3)) as sq;
   SELECT 3
   postgres=# select * from foo;
    column1 | column2
   ---------+---------
          1 |
            |       2
          3 |       3
   (3 rows)
   
   postgres=# select count(*) from foo;
    count
   -------
        3
   (1 row)
   ```
   Does it make sense to reframe these optimizations as the following:
   `take_optimizable_table_count` (current `take_optimizable_count`)=> comes 
from `COUNT(*)` and returns `num_rows`
   `take_optimizable_column_count` (current 
`take_optimizable_count_with_nulls`) => comes from `COUNT(col)` and return 
`num_rows - null_count` for `col`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] matthewmturner commented on a change in pull request #1063: Optimize count agg expr with null column statistics

Reply via email to