[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2674: Fix `AggregateStatistics` optimization so it doesn't change output type

GitBox Thu, 02 Jun 2022 12:10:14 -0700


alamb commented on code in PR #2674:
URL: https://github.com/apache/arrow-datafusion/pull/2674#discussion_r888250058



##########
datafusion/core/src/physical_optimizer/aggregate_statistics.rs:
##########
@@ -293,38 +297,80 @@ mod tests {
     /// Checks that the count optimization was applied and we still get the 
right result
     async fn assert_count_optim_success(plan: AggregateExec, nulls: bool) -> 
Result<()> {
         let session_ctx = SessionContext::new();
-        let task_ctx = session_ctx.task_ctx();
         let conf = session_ctx.copied_config();
-        let optimized = AggregateStatistics::new().optimize(Arc::new(plan), 
&conf)?;
+        let plan = Arc::new(plan) as _;
+        let optimized = AggregateStatistics::new().optimize(Arc::clone(&plan), 
&conf)?;
 
         let (col, count) = match nulls {

Review Comment:
   I believe it is really controlling `count(*)` vs `COUNT(col)` -- I 
consolidated the differences in eb14658de7 into a `TestAggregate` struct and I 
think it is much more understandable now 



##########
datafusion/core/src/physical_optimizer/aggregate_statistics.rs:
##########
@@ -293,38 +297,80 @@ mod tests {
     /// Checks that the count optimization was applied and we still get the 
right result
     async fn assert_count_optim_success(plan: AggregateExec, nulls: bool) -> 
Result<()> {
         let session_ctx = SessionContext::new();
-        let task_ctx = session_ctx.task_ctx();
         let conf = session_ctx.copied_config();
-        let optimized = AggregateStatistics::new().optimize(Arc::new(plan), 
&conf)?;
+        let plan = Arc::new(plan) as _;
+        let optimized = AggregateStatistics::new().optimize(Arc::clone(&plan), 
&conf)?;
 
         let (col, count) = match nulls {
-            false => (Field::new("COUNT(UInt8(1))", DataType::UInt64, false), 
3),
-            true => (Field::new("COUNT(a)", DataType::UInt64, false), 2),
+            false => (Field::new(COUNT_STAR_NAME, DataType::Int64, false), 3),
+            true => (Field::new("COUNT(a)", DataType::Int64, false), 2),
         };
 
         // A ProjectionExec is a sign that the count optimization was applied
         assert!(optimized.as_any().is::<ProjectionExec>());
+        let task_ctx = session_ctx.task_ctx();
         let result = common::collect(optimized.execute(0, task_ctx)?).await?;
         assert_eq!(result[0].schema(), Arc::new(Schema::new(vec![col])));
         assert_eq!(
             result[0]
                 .column(0)
                 .as_any()
-                .downcast_ref::<UInt64Array>()
+                .downcast_ref::<Int64Array>()
                 .unwrap()
                 .values(),
             &[count]
         );
+
+        // Validate that the optimized plan returns the exact same
+        // answer (both schema and data) as the original plan
+        let task_ctx = session_ctx.task_ctx();
+        let plan_result = common::collect(plan.execute(0, task_ctx)?).await?;
+        assert_eq!(normalize(result), normalize(plan_result));

Review Comment:
   I removed the normalization in 171c89901ecdadca6c2eccb2973bc7ad0990c92f and 
I think it is much simpler to follow now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2674: Fix `AggregateStatistics` optimization so it doesn't change output type

Reply via email to