Re: Default setting for enable_hashagg_disk

Melanie Plageman Tue, 09 Jun 2020 18:21:04 -0700

On Thu, Apr 9, 2020 at 1:02 PM Jeff Davis <pg...@j-davis.com> wrote:

> 2. enable_groupingsets_hash_disk (default false):
>
> This is about how we choose which grouping sets to hash and which to
> sort when generating mixed mode paths.
>
> Even before this patch, there are quite a few paths that could be
> generated. It tries to estimate the size of each grouping set's hash
> table, and then see how many it can fit in work_mem (knapsack), while
> also taking advantage of any path keys, etc.
>
> With Disk-based Hash Aggregation, in principle we can generate paths
> representing any combination of hashing and sorting for the grouping
> sets. But that would be overkill (and grow to a huge number of paths if
> we have more than a handful of grouping sets). So I think the existing
> planner logic for grouping sets is fine for now. We might come up with
> a better approach later.
>
> But that created a testing problem, because if the planner estimates
> correctly, no hashed grouping sets will spill, and the spilling code
> won't be exercised. This GUC makes the planner disregard which grouping
> sets' hash tables will fit, making it much easier to exercise the
> spilling code. Is there a better way I should be testing this code
> path?
>
>
So, I was catching up on email and noticed the last email in this
thread.


I think I am not fully understanding what enable_groupingsets_hash_disk
does. Is it only for testing?

Using the tests you added to src/test/regress/sql/groupingsets.sql, I
did get a plan that looks like hashagg is spilling to disk (goes through
hashagg_spill_tuple() code path and has number of batches reported in
Explain) in a MixedAgg plan for a grouping sets query even with
enable_groupingsets_hash_disk set to false. You don't have the exact
query I tried (below) in the test suite, but it is basically what is
already there, so I must be missing something.

set enable_hashagg_disk = true;
SET enable_groupingsets_hash_disk = false;
SET work_mem='64kB';
set enable_hashagg = true;
set jit_above_cost = 0;
drop table if exists gs_hash_1;
create table gs_hash_1 as
select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
  (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
   from generate_series(0,199999) g) s
group by cube (g1000,g100,g10);

explain (analyze, costs off, timing off)
select g1000, g100, g10
from gs_hash_1 group by cube (g1000,g100,g10);

                          QUERY PLAN
--------------------------------------------------------------
 MixedAggregate (actual rows=9648 loops=1)
   Hash Key: g10
   Hash Key: g10, g1000
   Hash Key: g100
   Hash Key: g100, g10
   Group Key: g1000, g100, g10
   Group Key: g1000, g100
   Group Key: g1000
   Group Key: ()
   Peak Memory Usage: 233 kB
   Disk Usage: 1600 kB
   HashAgg Batches: 2333
   ->  Sort (actual rows=4211 loops=1)
         Sort Key: g1000, g100, g10
         Sort Method: external merge  Disk: 384kB
         ->  Seq Scan on gs_hash_1 (actual rows=4211 loops=1)

Anyway, when I throw in the stats trick that is used in join_hash.sql:

alter table gs_hash_1 set (autovacuum_enabled = 'false');
update pg_class set reltuples = 10 where relname = 'gs_hash_1';

I get a MixedAgg plan that doesn't have any Sort below and uses much
more disk.

                        QUERY PLAN
----------------------------------------------------------
 MixedAggregate (actual rows=4211 loops=1)
   Hash Key: g1000, g100, g10
   Hash Key: g1000, g100
   Hash Key: g1000
   Hash Key: g100, g10
   Hash Key: g100
   Hash Key: g10, g1000
   Hash Key: g10
   Group Key: ()
   Peak Memory Usage: 405 kB
   Disk Usage: 59712 kB
   HashAgg Batches: 4209
   ->  Seq Scan on gs_hash_1 (actual rows=200000 loops=1)

I'm not sure if this is more what you were looking for--or maybe I am
misunderstanding the guc.

-- 
Melanie Plageman

Re: Default setting for enable_hashagg_disk

Reply via email to