On Tue, Jul 30, 2019 at 11:05 PM Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
> On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote: > >On Wed, Jun 12, 2019 at 10:58 AM Richard Guo <ri...@pivotal.io> wrote: > > > >> Hi all, > >> > >> Paul and I have been hacking recently to implement parallel grouping > >> sets, and here we have two implementations. > >> > >> Implementation 1 > >> ================ > >> > >> Attached is the patch and also there is a github branch [1] for this > >> work. > >> > > > >Rebased with the latest master. > > > > Hi Richard, > > thanks for the rebased patch. I think the patch is mostly fine (at least I > don't see any serious issues). A couple minor comments: > Hi Tomas, Thank you for reviewing this patch. > > 1) I think get_number_of_groups() would deserve a short explanation why > it's OK to handle (non-partial) grouping sets and regular GROUP BY in the > same branch. Before these cases were clearly separated, now it seems a bit > mixed up and it may not be immediately obvious why it's OK. > Added a short comment in get_number_of_groups() explaining the behavior when doing partial aggregation for grouping sets. > > 2) There are new regression tests, but they are not added to any schedule > (parallel or serial), and so are not executed as part of "make check". I > suppose this is a mistake. > Yes, thanks. Added the new regression test in parallel_schedule and serial_schedule. > > 3) The regression tests do check plan and results like this: > > EXPLAIN (COSTS OFF, VERBOSE) SELECT ...; > SELECT ... ORDER BY 1, 2, 3; > > which however means that the query might easily use a different plan than > what's verified in the eplain (thanks to the additional ORDER BY clause). > So I think this should explain and execute the same query. > > (In this case the plans seems to be the same, but that may easily change > in the future, and we could miss it here, failing to verify the results.) > Thank you for pointing this out. Fixed it in V4 patch. > > 4) It might be a good idea to check the negative case too, i.e. a query on > data set that we should not parallelize (because the number of partial > groups would be too high). > Yes, agree. Added a negative case. > > > Do you have any plans to hack on the second approach too? AFAICS those two > approaches are complementary (address different data sets / queries), and > it would be nice to have both. One of the things I've been wondering is if > we need to invent gset_id as a new concept, or if we could simply use the > existing GROUPING() function - that uniquely identifies the grouping set. > > Yes, I'm planning to hack on the second approach in short future. I'm also reconsidering the gset_id stuff since it brings a lot of complexity for the second approach. I agree with you that we can try GROUPING() function to see if it can replace gset_id. Thanks Richard
v4-0001-Implementing-parallel-grouping-sets.patch
Description: Binary data