On 03/20/2016 09:58 AM, David Rowley wrote:
On 20 March 2016 at 03:19, Robert Haas <robertmh...@gmail.com> wrote:
On Fri, Mar 18, 2016 at 10:32 PM, David Rowley
<david.row...@2ndquadrant.com> wrote:
Updated patch is attached.

I think this looks structurally correct now, and I think it's doing
the right thing as far as parallelism is concerned.  I don't see any
obvious problems in the rest of it, either, but I haven't thought
hugely deeply about the way you are doing the costing, nor have I
totally convinced myself that all of the PathTarget and setrefs stuff
is correct.  But I think it's probably pretty close.  I'll study it
some more next week.

Thank you for the reviews. The only thing I can think to mention which
I've not already is that I designed estimate_hashagg_tablesize() to be
reusable in various places in planner.c, yet I've only made use of it
in create_grouping_paths(). I would imagine that it might be nice to
also modify the other places which perform a similar calculation to
use that function, but I don't think that needs to be done for this
patch... perhaps a follow-on cleanup.

Hmmm, so how many places that could use the new function are there? I've only found these two:

   create_distinct_paths (planner.c)
   choose_hashed_setop (prepunion.c)

That doesn't seem like an extremely high number, so perhaps doing it in this patch would be fine. However if the function is defined as static, choose_hashed_setop won't be able to use it anyway (well, it'll have to move the prototype into a header and remove the static).

I wonder why we would need to allow cost_agg=NULL, though? I mean, if there are no costing information, wouldn't it be better to either raise an error, or at the very least do something like:

    } else
        hashentrysize += hash_agg_entry_size(0);

which is what create_distinct_paths does?

I'm not sure changing the meaning of enable_hashagg like this is a good idea. It worked as a hard switch before, while with this change that would not be the case. Or more accurately - it would not be the case for aggregates, but it would still work the old way for other types of plans. Not sure that's a particularly good idea.

What about introducing a GUC to enable parallel aggregate, while still allowing other types of parallel nodes (I guess that would be handy for other types of parallel nodes - it's a bit blunt tool, but tweaking max_parallel_degree is even blumter)? e.g. enable_parallelagg?

I do also have a question about parallel aggregate vs. work_mem. Nowadays we mostly say to users a query may allocate a multiple of work_mem, up to one per node in the plan. Apparently with parallel aggregate we'll have to say "multiplied by number of workers", because each aggregate worker may allocate up to hashaggtablesize. Is that reasonable? Shouldn't we restrict the total size of hash tables in all workers somehow?

create_grouping_paths also contains this comment:

  * Generate HashAgg Path providing the estimated hash table size is not
  * too big, although if no other Paths were generated above, then we'll
  * begrudgingly generate one so that we actually have a Path to work
  * with.

I'm not sure this is a particularly clear comment, I think the old one was way much informative despite being a single line:

/* Consider reasons to disable hashing, but only if we can sort instead */

BTW create_grouping_paths probably grew to a size when splitting it into smaller pieces would be helpful?


Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to