On Sun, Nov 26, 2017 at 3:04 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Nov 21, 2017 at 5:38 PM, Peter Geoghegan <p...@bowt.ie> wrote: >>> That having been said, I think the place where our plans most commonly >>> go wrong is where we incorrectly estimate the number of tuples by >>> multiple orders of magnitude - 100x is common, 1000x is common, a >>> million x is not uncommon, even a billion x is not unheard-of. And I >>> don't think there's any way to make a hash join happy if it thinks >>> it's going to need 1 batch and it ends up needing a million batches. >> >> What about dynamic role reversal? That could make a big difference. > > In the best case it's great, but it looks to me like there are a lot > of thorny problems.
There are loads of inter-related topics discussed in this thread, including some operator-specific stuff like the above, and some more general stuff, all requiring more research. In the meantime, I wonder if there are some simpler incremental improvements we could consider. Since work_mem currently acts as a kind of per executor node instance limit, the system-wide peak memory usage could be described as number of concurrent queries * number of executor nodes * number of parallel participants * work_mem. In the past I think the number of executor nodes was practically anchored to the ground by the number of relations in the query (not necessarily linearly, but not far off it), and the number of parallel participants was one. With the advent of parallel query we have this new multiplying term, and with the advent of partitions and partition-wise join we have exciting new ways to explode the number of executor nodes when the user only explicitly named a few relations. We could imagine various levels of memory budgeting: 1. work_mem_per_system (global budget). 2. work_mem_per_query (work_mem somehow shared out between executor nodes). 3. Per planned executor node budget (workers get a fraction of work_mem for each node). 4. What we have today: per executor node instance budget (workers get to use work_mem for each node). 1 and 2 seem like they might be boil-the-ocean problems. But as far as I know moving from 4 to 3 would merely require warming up a minor lake. That would take out one of the multipliers, and would remove a perverse incentive from any potential cost-based parallel degree choosing algorithms (you can print free memory by adding new workers.) Parallel Hash either combines the memory budgets of all participants to make one large no-partition hash table, or partitions the inner relation into work_mem sized batches and loads several of them into memory at the same time (at most one per participant). Either way the total memory usage is participants * work_mem, consistent with policy 4 and consistent with the total budget given to equivalent parallel-oblivious hash join, sort-merge join or any other node. If we switched to policy 3 and (say) work_mem were somehow automagically adjusted to be divided by number of participants at planning and execution time, then Parallel Hash wouldn't have to change at all to conform to the policy. It would use at most work_mem per Parallel Hash node, no matter how many workers and no matter which of its strategies it picked (either it receives a budget of work_mem / participants, and then multiplies it by participants to create a no-partition hash table combining the participants' budgets, or it lets each participant chew on smaller hash tables adding up to at most work_mem). Just the same total per-node budget as any other executor node gets. -- Thomas Munro http://www.enterprisedb.com