On 03/20/2016 09:58 AM, David Rowley wrote:
On 20 March 2016 at 03:19, Robert Haas <robertmh...@gmail.com> wrote:
On Fri, Mar 18, 2016 at 10:32 PM, David Rowley
Updated patch is attached.
I think this looks structurally correct now, and I think it's doing
the right thing as far as parallelism is concerned. I don't see any
obvious problems in the rest of it, either, but I haven't thought
hugely deeply about the way you are doing the costing, nor have I
totally convinced myself that all of the PathTarget and setrefs stuff
is correct. But I think it's probably pretty close. I'll study it
some more next week.
Thank you for the reviews. The only thing I can think to mention which
I've not already is that I designed estimate_hashagg_tablesize() to be
reusable in various places in planner.c, yet I've only made use of it
in create_grouping_paths(). I would imagine that it might be nice to
also modify the other places which perform a similar calculation to
use that function, but I don't think that needs to be done for this
patch... perhaps a follow-on cleanup.
Hmmm, so how many places that could use the new function are there? I've
only found these two:
That doesn't seem like an extremely high number, so perhaps doing it in
this patch would be fine. However if the function is defined as static,
choose_hashed_setop won't be able to use it anyway (well, it'll have to
move the prototype into a header and remove the static).
I wonder why we would need to allow cost_agg=NULL, though? I mean, if
there are no costing information, wouldn't it be better to either raise
an error, or at the very least do something like:
hashentrysize += hash_agg_entry_size(0);
which is what create_distinct_paths does?
I'm not sure changing the meaning of enable_hashagg like this is a good
idea. It worked as a hard switch before, while with this change that
would not be the case. Or more accurately - it would not be the case for
aggregates, but it would still work the old way for other types of
plans. Not sure that's a particularly good idea.
What about introducing a GUC to enable parallel aggregate, while still
allowing other types of parallel nodes (I guess that would be handy for
other types of parallel nodes - it's a bit blunt tool, but tweaking
max_parallel_degree is even blumter)? e.g. enable_parallelagg?
I do also have a question about parallel aggregate vs. work_mem.
Nowadays we mostly say to users a query may allocate a multiple of
work_mem, up to one per node in the plan. Apparently with parallel
aggregate we'll have to say "multiplied by number of workers", because
each aggregate worker may allocate up to hashaggtablesize. Is that
reasonable? Shouldn't we restrict the total size of hash tables in all
create_grouping_paths also contains this comment:
* Generate HashAgg Path providing the estimated hash table size is not
* too big, although if no other Paths were generated above, then we'll
* begrudgingly generate one so that we actually have a Path to work
I'm not sure this is a particularly clear comment, I think the old one
was way much informative despite being a single line:
/* Consider reasons to disable hashing, but only if we can sort instead */
BTW create_grouping_paths probably grew to a size when splitting it into
smaller pieces would be helpful?
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: