I want customers to be able to run large OLAP queries on PostgreSQL, using as much memory as possible, to avoid spilling — without running out of memory.
There are other ways to run out of memory, but the fastest and easiest way, on an OLAP query, is to use a lot of work_mem. (This is true for any SQL database: SQL operators are “usually” streaming operators... except for those that use work_mem.) PostgreSQL already supports the work_mem GUC, and every PostgreSQL operator tries very hard to spill to disk rather than exceed its work_mem limit. For now, I’m not concerned about other ways for queries to run out of memory — just work_mem. I like the way PostgreSQL operators respect work_mem, but I can’t find a good way to set the work_mem GUC. Oracle apparently had the same problem, with their RDBMS, 20 years ago [1]: “In releases earlier than Oracle Database 10g, the database administrator controlled the maximum size of SQL work areas by setting the following parameters: SORT_AREA_SIZE, HASH_AREA_SIZE, ... Setting these parameters is difficult, because the maximum work area size is ideally selected from the data input size and the total number of work areas active in the system. These two factors vary greatly from one work area to another and from one time to another. Thus, the various *_AREA_SIZE parameters are difficult to tune under the best of circumstances. “For this reason, Oracle strongly recommends that you leave automatic PGA memory management enabled.” It’s difficult to tune PostgreSQL’s work_mem and hash_mem_multiplier GUCs, under the best of circumstances, yeah. The work_mem and hash_mem_multiplier GUCs apply to all operators of a given type, even though two operators of the same type, even in the same query, might need vastly different amounts of work_mem. I would like a “query_work_mem” GUC, similar to what’s proposed in [2]. This GUC would be useful on its own, because it would be much easier to tune than the existing work_mem + hash_mem_multiplier GUCs; and it would serve as a milestone on a path to my ultimate goal of something like Oracle’s “automatic PGA memory management.” I call it “query_work_mem,” rather than “max_total_backend_memory,” because (a) for now, I care only about limiting work_mem, I’ll deal with other types of memory separately; and (b) “query” instead of “backend” avoids ambiguity over how much memory a recursively-compiled query gets. (Re (b), see “crosstab()”, [3]. The “sql text” executed by crosstab() would get its own query_work_mem allocation, separate from the query that called the crosstab() function.) The main problem I have with the “max_total_backend_memory” proposal, however, is that it “enforces” its limit by killing the offending query. This seems an overreaction to me, especially since PostgreSQL operators already know how to spill to disk. If a customer’s OLAP query exceeds its memory limit by 1%, I would rather spill 1% of their data to disk, instead of cancelling their entire query. (And if their OLAP query exceeds its memory limit by 1,000x... I still don’t want PostgreSQL to preemptively cancel it, because either the customer ends up OK with the overall performance, even with the spilling; or else they decide the query is taking too long, and cancel it themselves. I don’t want to be in the business of preemptively cancelling customer queries.) So, I want a query_work_mem GUC, and I want PostgreSQL to distribute that total query_work_mem to the query’s individual SQL operators, so that each operator will spill rather than exceed its per-operator limit. Making query_work_mem a session GUC makes it feasible for a DBA or an automated system to distribute memory from a global memory pool among individual queries, e.g. via pg_hint_plan(). So (as mentioned above), “query_work_mem” is useful to a DBA, and also a step toward a fully-automated memory-management system. How should “query_work_mem” work? Let’s start with an example: suppose we have an OLAP query that has 2 Hash Joins, and no other operators that use work_mem. Suppose we’re pretty sure that one of the Hash Joins will use 10 KB of work_mem, while the other will use 1 GB. And suppose we know that the PostgreSQL instance has 1 GB of memory available, for use by our OLAP query. (Perhaps we reserve 1 GB for OLAP queries, and we allow only 1 OLAP query at a time; or perhaps we have some sort of dynamic memory manager.) How should we configure PostgreSQL so that our OLAP query spills as little as possible, without running out of memory? -- First, we could just say: 2 operators, total available working memory is 1 GB — give each operator 512 MB. Then we would spill 512 MB from the large Hash Join, because we’d waste around 512 MB for the small Hash Join. We’re undersubscribing, to be safe, but our performance suffers. That’s bad! We’re basically wasting memory that the query would like to use. -- Second, we could say, instead: the small Hash Join is *highly unlikely* to use > 1 MB, so let’s just give both Hash Joins 1023 MB, expecting that the small Hash Join won’t use more than 1 MB of its 1023 MB allotment anyway, so we won’t run OOM. In effect, we’re oversubscribing, betting that the small Hash Join will just stay within some smaller, “unenforced” memory limit. In this example, this bet is probably fine — but it won’t work in general. I don’t want to be in the business of gambling with customer resources: if the small Hash Join is unlikely to use more than 1 MB, then let’s just assign it 1 MB of work_mem. That way, if I’m wrong, the customer’s query will just spill, instead of running out of memory. I am very strongly opposed to cancelling queries if/when we can just spill to disk. -- Third, we could just rewrite the existing “work_mem” logic so that all of the query’s operators draw, at runtime, from a single, “query_work_mem” pool. So, an operator won’t spill until all of “query_work_mem” is exhausted — by the operator itself, or by some other operator in the same query. But doing that runs into starvation/fairness problems, where an unlucky operator, executing later in the query, can’t get any query_work_mem, because earlier, greedy operators used up all of it. The solution I propose here is just to distribute the “query_work_mem” into individual, per-operator, work_mem limits. **Proposal:** I propose that we add a “query_work_mem” GUC, which works by distributing (using some algorithm to be described in a follow-up email) the entire “query_work_mem” to the query’s operators. And then each operator will spill when it exceeds its own work_mem limit. So we’ll preserve the existing “spill” logic as much as possible. To enable this to-be-described algorithm, I would add an “nbytes” field to the Path struct, and display this (and related info) in EXPLAIN PLAN. So the customer will be able to see how much work_mem the SQL compiler thinks they’ll need, per operator; and so will the algorithm. I wouldn’t change the existing planning logic (at least not in the initial implementaton). So the existing planning logic would choose between different SQL operators, still on the assumption that every operator that needs working memory will get work_mem [* hash_mem_multiplier]. This assumption might understate or overstate the actual working memory we’ll give the operator, at runtime. If it understates, the planner will be biased in favor of operators that don’t use much working memory. If it overstates, the planner will be biased in favor of operators that use too much working memory. (We could add a feedback loop to the planner, or even something simple like generating multiple path, at different “work_mem” limits, but everything I can think of here adds complexity without much potential benefit. So I would defer any changes to the planner behavior until later, if ever.) The to-be-described algorithm would look at a query’s Paths’ “nbytes” fields, as well as the session “work_mem” GUC (which would, now, serve as a hint to the SQL compiler), and decide how much of “query_work_mem” to assign to the corresponding Plan node. It would assign that limit to a new “work_mem” field, on the Plan node. And this limit would also be exposed, of course, in EXPLAIN ANALYZE, along with the actual work_mem usage, which might very well exceed the limit. This will let the customer know when a query spills, and why. I would write the algorithm to maintain the existing work_mem behavior, as much as possible. (Backward compatibility is good!) Most likely, it would treat “work_mem” (and “hash_mem_multiplier”) as a *minimum* work_mem. Then, so long as query_work_mem exceeds the sum of work_mem [* hash _mem_multiplier] , for all operators in the query, all operators would be assigned at least work_mem, which would make my proposal a Pareto improvement. Last, at runtime, each PlanState would check its plan -> work_mem field, rather than the global work_mem GUC. Execution would otherwise be the same as today. What do you think? James [1] https://docs.oracle.com/en//database/oracle/oracle-database/23/admin/managing-memory.html#GUID-8D7FC70A-56D8-4CA1-9F74-592F04172EA7 [2] https://www.postgresql.org/message-id/flat/bd57d9a4c219cc1392665fd5fba61dde8027b3da.camel%40crunchydata.com [3] https://www.postgresql.org/docs/current/tablefunc.html