On Sun, Nov 26, 2017 at 3:04 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Nov 21, 2017 at 5:38 PM, Peter Geoghegan <p...@bowt.ie> wrote:
>>> That having been said, I think the place where our plans most commonly
>>> go wrong is where we incorrectly estimate the number of tuples by
>>> multiple orders of magnitude - 100x is common, 1000x is common, a
>>> million x is not uncommon, even a billion x is not unheard-of.  And I
>>> don't think there's any way to make a hash join happy if it thinks
>>> it's going to need 1 batch and it ends up needing a million batches.
>> What about dynamic role reversal? That could make a big difference.
> In the best case it's great, but it looks to me like there are a lot
> of thorny problems.

There are loads of inter-related topics discussed in this thread,
including some operator-specific stuff like the above, and some more
general stuff, all requiring more research.  In the meantime, I wonder
if there are some simpler incremental improvements we could consider.

Since work_mem currently acts as a kind of per executor node instance
limit, the system-wide peak memory usage could be described as number
of concurrent queries * number of executor nodes * number of parallel
participants * work_mem.  In the past I think the number of executor
nodes was practically anchored to the ground by the number of
relations in the query (not necessarily linearly, but not far off it),
and the number of parallel participants was one.  With the advent of
parallel query we have this new multiplying term, and with the advent
of partitions and partition-wise join we have exciting new ways to
explode the number of executor nodes when the user only explicitly
named a few relations.

We could imagine various levels of memory budgeting:

1.  work_mem_per_system (global budget).
2.  work_mem_per_query (work_mem somehow shared out between executor nodes).
3.  Per planned executor node budget (workers get a fraction of
work_mem for each node).
4.  What we have today: per executor node instance budget (workers get
to use work_mem for each node).

1 and 2 seem like they might be boil-the-ocean problems.  But as far
as I know moving from 4 to 3 would merely require warming up a minor
lake.  That would take out one of the multipliers, and would remove a
perverse incentive from any potential cost-based parallel degree
choosing algorithms (you can print free memory by adding new workers.)

Parallel Hash either combines the memory budgets of all participants
to make one large no-partition hash table, or partitions the inner
relation into work_mem sized batches and loads several of them into
memory at the same time (at most one per participant).  Either way the
total memory usage is participants * work_mem, consistent with policy
4 and consistent with the total budget given to equivalent
parallel-oblivious hash join, sort-merge join or any other node.

If we switched to policy 3 and (say) work_mem were somehow
automagically adjusted to be divided by number of participants at
planning and execution time, then Parallel Hash wouldn't have to
change at all to conform to the policy.  It would use at most work_mem
per Parallel Hash node, no matter how many workers and no matter which
of its strategies it picked (either it receives a budget of work_mem /
participants, and then multiplies it by participants to create a
no-partition hash table combining the participants' budgets, or it
lets each participant chew on smaller hash tables adding up to at most
work_mem).  Just the same total per-node budget as any other executor
node gets.

Thomas Munro

Reply via email to