Re: [HACKERS] 9.5: Memory-bounded HashAgg

Tomas Vondra Sun, 17 Aug 2014 09:40:40 -0700

On 10.8.2014 23:26, Jeff Davis wrote:
> This patch is requires the Memory Accounting patch, or something similar
> to track memory usage.
> 
> The attached patch enables hashagg to spill to disk, which means that
> hashagg will contain itself to work_mem even if the planner makes a
> bad misestimate of the cardinality.
> 
> This is a well-known concept; there's even a Berkeley homework
> assignment floating around to implement it -- in postgres 7.2, no
> less. I didn't take the exact same approach as the homework assignment
> suggests, but it's not much different, either. My apologies if some
> classes are still using this as a homework assignment, but postgres
> needs to eventually have an answer to this problem.
> 
> Included is a GUC, "enable_hashagg_disk" (default on), which allows
> the planner to choose hashagg even if it doesn't expect the hashtable
> to fit in memory. If it's off, and the planner misestimates the
> cardinality, hashagg will still use the disk to contain itself to
> work_mem.
> 
> One situation that might surprise the user is if work_mem is set too
> low, and the user is *relying* on a misestimate to pick hashagg. With
> this patch, it would end up going to disk, which might be
> significantly slower. The solution for the user is to increase
> work_mem.
> 
> Rough Design:
> 
> Change the hash aggregate algorithm to accept a generic "work item",
> which consists of an input file as well as some other bookkeeping
> information.
> 
> Initially prime the algorithm by adding a single work item where the
> file is NULL, indicating that it should read from the outer plan.
> 
> If the memory is exhausted during execution of a work item, then
> continue to allow existing groups to be aggregated, but do not allow new
> groups to be created in the hash table. Tuples representing new groups
> are saved in an output partition file referenced in the work item that
> is currently being executed.
> 
> When the work item is done, emit any groups in the hash table, clear the
> hash table, and turn each output partition file into a new work item.
> 
> Each time through at least some groups are able to stay in the hash
> table, so eventually none will need to be saved in output partitions, no
> new work items will be created, and the algorithm will terminate. This
> is true even if the number of output partitions is always one.
> 
> Open items:
>    * costing
>    * EXPLAIN details for disk usage
>    * choose number of partitions intelligently
>    * performance testing
> 
> Initial tests indicate that it can be competitive with sort+groupagg
> when the disk is involved, but more testing is required.
> 
> Feedback welcome.


I've been working on this for a few hours - getting familiar with the
code, testing queries etc. Two comments.

1) Apparently there's something broken, because with this:

   create table table_b (fk_id int, val_a int, val_b int);
   insert into table_b
      select i, mod(i,1000), mod(i,1000)
        from generate_series(1,10000000) s(i);
   analyze table_b;

   I get this:

   set work_mem = '8MB';
   explain analyze select fk_id, count(*)
           from table_b where val_a < 50 and val_b < 50 group by 1;
   > The connection to the server was lost. Attempting reset: Failed.

   Stacktrace attached, but apparently there's a segfault in
   advance_transition_function when accessing pergroupstate.

   This happened for all queries that I tried, once they needed to do
   the batching.

2) Using the same hash value both for dynahash and batching seems
   really fishy to me. I'm not familiar with dynahash, but I'd bet
   the way it's done now will lead to bad distribution in the hash
   table (some buckets will be always empty in some batches, etc.).

   This is why hashjoin tries so hard to use non-overlapping parts
   of the hash for batchno/bucketno.

   The hashjoin implements it's onw hash table, which makes it clear
   how the bucket is derived from the hash value. I'm not sure how
   dynahash does that, but I'm pretty sure we can'd just reuse the hash
   value like this.

   I see two options - compute our own hash value, or somehow derive
   a new one (e.g. by doing "hashvalue XOR random_seed"). I'm not sure
   the latter would work, though.

regards
Tomas

Program received signal SIGSEGV, Segmentation fault.
0x000000000064e6fc in advance_transition_function (aggstate=0x11405c0, peraggstate=0x1143540, pergroupstate=0x8) at nodeAgg.c:465
465                     if (pergroupstate->noTransValue)
(gdb) bt
#0  0x000000000064e6fc in advance_transition_function (aggstate=0x11405c0, peraggstate=0x1143540, pergroupstate=0x8) at nodeAgg.c:465
#1  0x000000000064eb0e in advance_aggregates (aggstate=0x11405c0, pergroup=0x8) at nodeAgg.c:621
#2  0x00000000006502a7 in agg_fill_hash_table (aggstate=0x11405c0) at nodeAgg.c:1584
#3  0x000000000064fc3f in ExecAgg (node=0x11405c0) at nodeAgg.c:1289
#4  0x000000000063c754 in ExecProcNode (node=0x11405c0) at execProcnode.c:476
#5  0x000000000063a483 in ExecutePlan (estate=0x11404a8, planstate=0x11405c0, operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection, dest=0xd487c0 <donothingDR>) at execMain.c:1475
#6  0x0000000000638675 in standard_ExecutorRun (queryDesc=0x113e658, direction=ForwardScanDirection, count=0) at execMain.c:308
#7  0x000000000063850d in ExecutorRun (queryDesc=0x113e658, direction=ForwardScanDirection, count=0) at execMain.c:256
#8  0x00000000005d7fab in ExplainOnePlan (plannedstmt=0x113e5c0, into=0x0, es=0x7fffbb7a4d50, queryString=0x10e3668 "explain analyze select fk_id, count(*) from table_b where val_a < 50 and val_b < 50 group by 1;", params=0x0, 
    planduration=0x7fffbb7a4cd0) at explain.c:481
#9  0x00000000005d7c98 in ExplainOneQuery (query=0x10fcc10, into=0x0, es=0x7fffbb7a4d50, queryString=0x10e3668 "explain analyze select fk_id, count(*) from table_b where val_a < 50 and val_b < 50 group by 1;", params=0x0) at explain.c:336
#10 0x00000000005d791c in ExplainQuery (stmt=0x10e4b50, queryString=0x10e3668 "explain analyze select fk_id, count(*) from table_b where val_a < 50 and val_b < 50 group by 1;", params=0x0, dest=0x10fcb40) at explain.c:227
#11 0x00000000007a0ceb in standard_ProcessUtility (parsetree=0x10e4b50, queryString=0x10e3668 "explain analyze select fk_id, count(*) from table_b where val_a < 50 and val_b < 50 group by 1;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
    dest=0x10fcb40, completionTag=0x7fffbb7a4f90 "") at utility.c:651
#12 0x00000000007a0499 in ProcessUtility (parsetree=0x10e4b50, queryString=0x10e3668 "explain analyze select fk_id, count(*) from table_b where val_a < 50 and val_b < 50 group by 1;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
    dest=0x10fcb40, completionTag=0x7fffbb7a4f90 "") at utility.c:311
#13 0x000000000079f5b2 in PortalRunUtility (portal=0x112e2d8, utilityStmt=0x10e4b50, isTopLevel=1 '\001', dest=0x10fcb40, completionTag=0x7fffbb7a4f90 "") at pquery.c:1187
#14 0x000000000079f316 in FillPortalStore (portal=0x112e2d8, isTopLevel=1 '\001') at pquery.c:1061
#15 0x000000000079ec3a in PortalRun (portal=0x112e2d8, count=9223372036854775807, isTopLevel=1 '\001', dest=0x10a5658, altdest=0x10a5658, completionTag=0x7fffbb7a5180 "") at pquery.c:785
#16 0x0000000000798d9a in exec_simple_query (query_string=0x10e3668 "explain analyze select fk_id, count(*) from table_b where val_a < 50 and val_b < 50 group by 1;") at postgres.c:1045
#17 0x000000000079cf40 in PostgresMain (argc=1, argv=0x108bd30, dbname=0x108bb98 "test", username=0x108bb78 "tomas") at postgres.c:4010
#18 0x0000000000725a48 in BackendRun (port=0x10ab1e0) at postmaster.c:4113
#19 0x0000000000725184 in BackendStartup (port=0x10ab1e0) at postmaster.c:3787
#20 0x0000000000721a51 in ServerLoop () at postmaster.c:1566
#21 0x00000000007210fe in PostmasterMain (argc=3, argv=0x108a910) at postmaster.c:1219
#22 0x0000000000683fef in main (argc=3, argv=0x108a910) at main.c:219
(gdb) 
#0  0x000000000064e6fc in advance_transition_function (aggstate=0x11405c0, peraggstate=0x1143540, pergroupstate=0x8) at nodeAgg.c:465
#1  0x000000000064eb0e in advance_aggregates (aggstate=0x11405c0, pergroup=0x8) at nodeAgg.c:621
#2  0x00000000006502a7 in agg_fill_hash_table (aggstate=0x11405c0) at nodeAgg.c:1584
#3  0x000000000064fc3f in ExecAgg (node=0x11405c0) at nodeAgg.c:1289
#4  0x000000000063c754 in ExecProcNode (node=0x11405c0) at execProcnode.c:476
#5  0x000000000063a483 in ExecutePlan (estate=0x11404a8, planstate=0x11405c0, operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection, dest=0xd487c0 <donothingDR>) at execMain.c:1475
#6  0x0000000000638675 in standard_ExecutorRun (queryDesc=0x113e658, direction=ForwardScanDirection, count=0) at execMain.c:308
#7  0x000000000063850d in ExecutorRun (queryDesc=0x113e658, direction=ForwardScanDirection, count=0) at execMain.c:256
#8  0x00000000005d7fab in ExplainOnePlan (plannedstmt=0x113e5c0, into=0x0, es=0x7fffbb7a4d50, queryString=0x10e3668 "explain analyze select fk_id, count(*) from table_b where val_a < 50 and val_b < 50 group by 1;", params=0x0, 
    planduration=0x7fffbb7a4cd0) at explain.c:481
#9  0x00000000005d7c98 in ExplainOneQuery (query=0x10fcc10, into=0x0, es=0x7fffbb7a4d50, queryString=0x10e3668 "explain analyze select fk_id, count(*) from table_b where val_a < 50 and val_b < 50 group by 1;", params=0x0) at explain.c:336
#10 0x00000000005d791c in ExplainQuery (stmt=0x10e4b50, queryString=0x10e3668 "explain analyze select fk_id, count(*) from table_b where val_a < 50 and val_b < 50 group by 1;", params=0x0, dest=0x10fcb40) at explain.c:227
#11 0x00000000007a0ceb in standard_ProcessUtility (parsetree=0x10e4b50, queryString=0x10e3668 "explain analyze select fk_id, count(*) from table_b where val_a < 50 and val_b < 50 group by 1;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
    dest=0x10fcb40, completionTag=0x7fffbb7a4f90 "") at utility.c:651
#12 0x00000000007a0499 in ProcessUtility (parsetree=0x10e4b50, queryString=0x10e3668 "explain analyze select fk_id, count(*) from table_b where val_a < 50 and val_b < 50 group by 1;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
    dest=0x10fcb40, completionTag=0x7fffbb7a4f90 "") at utility.c:311
#13 0x000000000079f5b2 in PortalRunUtility (portal=0x112e2d8, utilityStmt=0x10e4b50, isTopLevel=1 '\001', dest=0x10fcb40, completionTag=0x7fffbb7a4f90 "") at pquery.c:1187
#14 0x000000000079f316 in FillPortalStore (portal=0x112e2d8, isTopLevel=1 '\001') at pquery.c:1061
#15 0x000000000079ec3a in PortalRun (portal=0x112e2d8, count=9223372036854775807, isTopLevel=1 '\001', dest=0x10a5658, altdest=0x10a5658, completionTag=0x7fffbb7a5180 "") at pquery.c:785
#16 0x0000000000798d9a in exec_simple_query (query_string=0x10e3668 "explain analyze select fk_id, count(*) from table_b where val_a < 50 and val_b < 50 group by 1;") at postgres.c:1045
#17 0x000000000079cf40 in PostgresMain (argc=1, argv=0x108bd30, dbname=0x108bb98 "test", username=0x108bb78 "tomas") at postgres.c:4010
#18 0x0000000000725a48 in BackendRun (port=0x10ab1e0) at postmaster.c:4113
#19 0x0000000000725184 in BackendStartup (port=0x10ab1e0) at postmaster.c:3787
#20 0x0000000000721a51 in ServerLoop () at postmaster.c:1566
#21 0x00000000007210fe in PostmasterMain (argc=3, argv=0x108a910) at postmaster.c:1219
#22 0x0000000000683fef in main (argc=3, argv=0x108a910) at main.c:219

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5: Memory-bounded HashAgg

Reply via email to