[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1063: Optimize count agg expr with null column statistics

GitBox Wed, 10 Nov 2021 07:01:32 -0800


alamb commented on a change in pull request #1063:
URL: https://github.com/apache/arrow-datafusion/pull/1063#discussion_r746670550




##########
File path: datafusion/src/physical_optimizer/aggregate_statistics.rs
##########
@@ -250,38 +289,41 @@ mod tests {
     }
 
     /// Checks that the count optimization was applied and we still get the 
right result
-    async fn assert_count_optim_success(plan: HashAggregateExec) -> Result<()> 
{
+    async fn assert_count_optim_success(
+        plan: HashAggregateExec,
+        nulls: bool,
+    ) -> Result<()> {
         let conf = ExecutionConfig::new();
         let optimized = AggregateStatistics::new().optimize(Arc::new(plan), 
&conf)?;
 
+        let (col, count) = match nulls {
+            false => (Field::new("COUNT(Uint8(1))", DataType::UInt64, false), 
3),

Review comment:
       I don't understand why the column name output is different for columns 
with NULLs (and columns that don't have nulls)
   
   I think the difference is if the aggregate is `COUNT(*)` --> 
`COUNT(UInt8(1))` and `COUNT(col)` --> `COUNT(col)`
   
   This can be seen in the datafusion-cli on master
   
   ```
   ❯ create table foo as select * from (values (1, NULL), (2, 2), (3,3)) as sq;
   0 rows in set. Query took 0.007 seconds.
   
   ❯ select * from foo;
   +---------+---------+
   | column1 | column2 |
   +---------+---------+
   | 1       |         |
   | 2       | 2       |
   | 3       | 3       |
   +---------+---------+
   3 rows in set. Query took 0.003 seconds.
   ❯ select count(column1) from foo;
   +--------------------+
   | COUNT(foo.column1) |
   +--------------------+
   | 3                  |
   +--------------------+
   1 row in set. Query took 0.004 seconds.
   ❯ select count(column2) from foo;
   +--------------------+
   | COUNT(foo.column2) |
   +--------------------+
   | 2                  |
   +--------------------+
   1 row in set. Query took 0.004 seconds.
   ❯ select count(*) from foo;
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 3               |
   +-----------------+
   1 row in set. Query took 0.002 seconds.
   ```
   
   

##########
File path: datafusion/src/physical_optimizer/aggregate_statistics.rs
##########
@@ -153,6 +159,39 @@ fn take_optimizable_count(
     None
 }
 
+/// If this agg_expr is a count that can be derived from the statistics, 
return it
+fn take_optimizable_count_with_nulls(
+    agg_expr: &dyn AggregateExpr,
+    stats: &Statistics,
+) -> Option<(ScalarValue, String)> {
+    if let (Some(num_rows), Some(col_stats), Some(casted_expr)) = (
+        stats.num_rows,
+        &stats.column_statistics,
+        agg_expr.as_any().downcast_ref::<expressions::Count>(),
+    ) {
+        if casted_expr.expressions().len() == 1 {

Review comment:
       it looks like this code handles `count(col)` whereas the code above only 
handles `count(*)` -- that seems strange -- perhaps we should update it so both 
can handle `count(col)` and `count(*)`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1063: Optimize count agg expr with null column statistics

Reply via email to