alamb commented on a change in pull request #1063:
URL: https://github.com/apache/arrow-datafusion/pull/1063#discussion_r746670550
##########
File path: datafusion/src/physical_optimizer/aggregate_statistics.rs
##########
@@ -250,38 +289,41 @@ mod tests {
}
/// Checks that the count optimization was applied and we still get the
right result
- async fn assert_count_optim_success(plan: HashAggregateExec) -> Result<()>
{
+ async fn assert_count_optim_success(
+ plan: HashAggregateExec,
+ nulls: bool,
+ ) -> Result<()> {
let conf = ExecutionConfig::new();
let optimized = AggregateStatistics::new().optimize(Arc::new(plan),
&conf)?;
+ let (col, count) = match nulls {
+ false => (Field::new("COUNT(Uint8(1))", DataType::UInt64, false),
3),
Review comment:
I don't understand why the column name output is different for columns
with NULLs (and columns that don't have nulls)
I think the difference is if the aggregate is `COUNT(*)` -->
`COUNT(UInt8(1))` and `COUNT(col)` --> `COUNT(col)`
This can be seen in the datafusion-cli on master
```
❯ create table foo as select * from (values (1, NULL), (2, 2), (3,3)) as sq;
0 rows in set. Query took 0.007 seconds.
❯ select * from foo;
+---------+---------+
| column1 | column2 |
+---------+---------+
| 1 | |
| 2 | 2 |
| 3 | 3 |
+---------+---------+
3 rows in set. Query took 0.003 seconds.
❯ select count(column1) from foo;
+--------------------+
| COUNT(foo.column1) |
+--------------------+
| 3 |
+--------------------+
1 row in set. Query took 0.004 seconds.
❯ select count(column2) from foo;
+--------------------+
| COUNT(foo.column2) |
+--------------------+
| 2 |
+--------------------+
1 row in set. Query took 0.004 seconds.
❯ select count(*) from foo;
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 3 |
+-----------------+
1 row in set. Query took 0.002 seconds.
```
##########
File path: datafusion/src/physical_optimizer/aggregate_statistics.rs
##########
@@ -153,6 +159,39 @@ fn take_optimizable_count(
None
}
+/// If this agg_expr is a count that can be derived from the statistics,
return it
+fn take_optimizable_count_with_nulls(
+ agg_expr: &dyn AggregateExpr,
+ stats: &Statistics,
+) -> Option<(ScalarValue, String)> {
+ if let (Some(num_rows), Some(col_stats), Some(casted_expr)) = (
+ stats.num_rows,
+ &stats.column_statistics,
+ agg_expr.as_any().downcast_ref::<expressions::Count>(),
+ ) {
+ if casted_expr.expressions().len() == 1 {
Review comment:
it looks like this code handles `count(col)` whereas the code above only
handles `count(*)` -- that seems strange -- perhaps we should update it so both
can handle `count(col)` and `count(*)`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]