ashdnazg commented on PR #17726:
URL: https://github.com/apache/datafusion/pull/17726#issuecomment-3347258929

   @alamb 
   I added a boolean column to Query 10 in h2o_medium:
   ```
   SELECT id1, id2, id3, id4, id5, id6, SUM(v3) AS v3, COUNT(*) AS count, (v1 % 
2)=0 AS b1 FROM x GROUP BY id1, id2, id3, id4, id5, id6, b1;
   ```
   
   And I get the following results locally
   
   Without the boolean impl:
   ```
   Query 1 iteration 1 took 8836.5 ms and returned 100000000 rows
   Query 1 iteration 2 took 8293.7 ms and returned 100000000 rows
   Query 1 iteration 3 took 7957.9 ms and returned 100000000 rows
   Query 1 avg time: 8362.69 ms
   ```
   
   With the boolean impl
   ```
   Query 1 iteration 1 took 5202.1 ms and returned 100000000 rows
   Query 1 iteration 2 took 4666.2 ms and returned 100000000 rows
   Query 1 iteration 3 took 4577.2 ms and returned 100000000 rows
   Query 1 avg time: 4815.19 ms
   ```
   
   In general the multi group by option seems to make certain scenarios worse.
   I added a flag for easy control of whether it's used in this commit: 
https://github.com/ashdnazg/datafusion/commit/6a0b13bc2a5f9305b38f9bd0fa87cbf865e9c37b
   and then checked:
   ```
   create or replace table foo as select (random() * 4)::integer as int_val, 
(random() * 4)::integer as int_val2, (random() * 4)::integer as int_val3  from 
generate_series(1, 1000000000);
   set datafusion.execution.enable_multi_group_by to false;
   select int_val, int_val2, int_val3 from foo GROUP BY int_val, int_val2, 
int_val3;
   set datafusion.execution.enable_multi_group_by to true;
   select int_val, int_val2, int_val3 from foo GROUP BY int_val, int_val2, 
int_val3;
   ```
   
   I get `Elapsed 1.037 seconds.` with the flag set to false and `Elapsed 1.382 
seconds.` with the flag set to true.
   
   Perhaps the logic for when to use it should be more careful than just 
"whenever it's supported". But I suspect this PR is not the right spot for that 
discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to