Re: [I] Improve performance of high cardinality grouping by reusing hash values [datafusion]

via GitHub Tue, 03 Sep 2024 23:58:57 -0700


yjshen commented on issue #11680:
URL: https://github.com/apache/datafusion/issues/11680#issuecomment-2328048261


   > guess the reason why performance improved may be simlar as the partial 
skipping?
   
   I would say partial skipping only contributes a small portion of improvement 
to Q32.
   
   ```
   set 
datafusion.execution.skip_partial_aggregation_probe_rows_threshold=10000000000;
   set datafusion.execution.skip_partial_aggregation_probe_ratio_threshold=1.0;
   explain analyze SELECT "WatchID", "ClientIP", COUNT(*) AS c, 
SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' 
GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;
   ```
   
   I get:
   
   ```
   AggregateExec: mode=Partial, gby=[WatchID@0 as WatchID, ClientIP@1 as 
ClientIP], aggr=[count(*), sum(hits.IsRefresh), avg(hits.ResolutionWidth)], 
metrics=[output_rows=13172392, elapsed_compute=1.403636372s, 
skipped_aggregation_rows=0, skipped_aggregation_computation_time=16ns]
   ```
   
   Remove the `explain analyze`, and I get the performance number:
   
   ```
   0.730
   0.535
   0.530
   ```
   
   Since we use `x2.15` time (with DataFusion 40.0.0) compared to DuckDB, 
partial skipping may not be essential for Q32.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Improve performance of high cardinality grouping by reusing hash values [datafusion]

Reply via email to