On Thu, Apr 9, 2020 at 1:02 PM Jeff Davis <pg...@j-davis.com> wrote: > 2. enable_groupingsets_hash_disk (default false): > > This is about how we choose which grouping sets to hash and which to > sort when generating mixed mode paths. > > Even before this patch, there are quite a few paths that could be > generated. It tries to estimate the size of each grouping set's hash > table, and then see how many it can fit in work_mem (knapsack), while > also taking advantage of any path keys, etc. > > With Disk-based Hash Aggregation, in principle we can generate paths > representing any combination of hashing and sorting for the grouping > sets. But that would be overkill (and grow to a huge number of paths if > we have more than a handful of grouping sets). So I think the existing > planner logic for grouping sets is fine for now. We might come up with > a better approach later. > > But that created a testing problem, because if the planner estimates > correctly, no hashed grouping sets will spill, and the spilling code > won't be exercised. This GUC makes the planner disregard which grouping > sets' hash tables will fit, making it much easier to exercise the > spilling code. Is there a better way I should be testing this code > path? > > So, I was catching up on email and noticed the last email in this thread.
I think I am not fully understanding what enable_groupingsets_hash_disk does. Is it only for testing? Using the tests you added to src/test/regress/sql/groupingsets.sql, I did get a plan that looks like hashagg is spilling to disk (goes through hashagg_spill_tuple() code path and has number of batches reported in Explain) in a MixedAgg plan for a grouping sets query even with enable_groupingsets_hash_disk set to false. You don't have the exact query I tried (below) in the test suite, but it is basically what is already there, so I must be missing something. set enable_hashagg_disk = true; SET enable_groupingsets_hash_disk = false; SET work_mem='64kB'; set enable_hashagg = true; set jit_above_cost = 0; drop table if exists gs_hash_1; create table gs_hash_1 as select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from (select g%1000 as g1000, g%100 as g100, g%10 as g10, g from generate_series(0,199999) g) s group by cube (g1000,g100,g10); explain (analyze, costs off, timing off) select g1000, g100, g10 from gs_hash_1 group by cube (g1000,g100,g10); QUERY PLAN -------------------------------------------------------------- MixedAggregate (actual rows=9648 loops=1) Hash Key: g10 Hash Key: g10, g1000 Hash Key: g100 Hash Key: g100, g10 Group Key: g1000, g100, g10 Group Key: g1000, g100 Group Key: g1000 Group Key: () Peak Memory Usage: 233 kB Disk Usage: 1600 kB HashAgg Batches: 2333 -> Sort (actual rows=4211 loops=1) Sort Key: g1000, g100, g10 Sort Method: external merge Disk: 384kB -> Seq Scan on gs_hash_1 (actual rows=4211 loops=1) Anyway, when I throw in the stats trick that is used in join_hash.sql: alter table gs_hash_1 set (autovacuum_enabled = 'false'); update pg_class set reltuples = 10 where relname = 'gs_hash_1'; I get a MixedAgg plan that doesn't have any Sort below and uses much more disk. QUERY PLAN ---------------------------------------------------------- MixedAggregate (actual rows=4211 loops=1) Hash Key: g1000, g100, g10 Hash Key: g1000, g100 Hash Key: g1000 Hash Key: g100, g10 Hash Key: g100 Hash Key: g10, g1000 Hash Key: g10 Group Key: () Peak Memory Usage: 405 kB Disk Usage: 59712 kB HashAgg Batches: 4209 -> Seq Scan on gs_hash_1 (actual rows=200000 loops=1) I'm not sure if this is more what you were looking for--or maybe I am misunderstanding the guc. -- Melanie Plageman