Re: POC: GROUP BY optimization

Ibrar Ahmed Tue, 13 Jul 2021 04:37:55 -0700

On Wed, Mar 10, 2021 at 4:05 AM Tomas Vondra <tomas.von...@enterprisedb.com>
wrote:


> Hi,
>
> I take a look at the patch today. Attached is the v12 and a separate
> patch with some comment tweaks and review comments.
>
>
> 1) I see the patch results in some plan changes in postgres_fdw. I
> assume it's somehow related to the sort costing changes, but I find it a
> bit suspicious. Why should the plan change from this:
>
>   Foreign Scan
>     Remote SQL: ... ORDER BY ... OFFSET 100 LIMIT 10;
>
> to
>
>   Limit
>     Foreign Scan
>       Remote SQL: ... ORDER BY ...
>
> But even if this is due to changes to order by costing, postponing the
> limit seems like a net loss - the sort happens before the limit, and by
> postponing the limit after foreign scan we need to transfer 100 more
> rows. So what's causing this?
>
> The difference in plan cost seems fairly substantial (old: 196.52, new:
> 241.94). But maybe that's OK and expected.
>
>
> 2) I wonder how much more expensive (in terms of CPU) is the new sort
> costing code. It's iterative, and it's calling functions to calculate
> number of groups etc. I wonder what's the impact simple queries?
>
>
> 3) It's not clear to me why we need "fake var" protection? Why would the
> estimate_num_groups() fail for that?
>
>
> 4) In one of the places a comment says DEFAULT_NUM_DISTINCT is not used
> because we're dealing with multiple columns. That makes sense, but maybe
> we should use DEFAULT_NUM_DISTINCT at least to limit the range. That is,
> with N columns we should restrict the nGroups estimate by:
>
>     min = DEFAULT_NUM_DISTINCT
>     max = Min(pow(DEFAULT_NUM_DISTINCT, ncolumns), tuples)
>
> Also, it's not clear what's the logic behind the formula:
>
>     nGroups = ceil(2.0 + sqrt(tuples) *
>        list_length(pathkeyExprs) / list_length(pathkeys));
>
> A comment explaining that would be helpful.
>
>
> 5) The compute_cpu_sort_cost comment includes two references (in a quite
> mixed-up way), and then there's another reference in the code. I suggest
> we list all of them in the function comment.
>
>
> 6) There's a bunch of magic values (various multipliers etc.). It's not
> clear which values are meant to be equal, etc. Let's introduce nicer
> constants with reasonable names.
>
>
> 7) The comment at get_cheapest_group_keys_order is a bit misleading,
> because it talks about "uniqueness" - but that's not what we're looking
> at here (I mean, is the column unique or not). We're looking at number
> of distinct values in the column, which is a different thing. Also, it'd
> be good to roughly explain what get_cheapest_group_keys_order does - how
> it calculates the order (permutations up to 4 values, etc.).
>
>
> 8) The comment at compute_cpu_sort_cost should also explain basics of
> the algorithm. I tried doing something like that, but I'm not sure I got
> all the details right and it probably needs further improvements.
>
>
> 9) The main concern I have is still about the changes in planner.c, and
> I think I've already shared it before. The code takes the grouping
> pathkeys, and just reshuffles them to what it believes is a better order
> (cheaper, ...). That is, there's on input pathkey list, and it gets
> transformed into another pathkey list. The problem is that even if this
> optimizes the sort, it may work against some optimization in the upper
> part of the plan.
>
> Imagine a query that does something like this:
>
>    SELECT a, b, count(*) FROM (
>       SELECT DISTINCT a, b, c, d FROM x
>    ) GROUP BY a, b;
>
> Now, from the costing perspective, it may be cheaper to do the inner
> grouping by "c, d, a, b" for example. But that works directly against
> the second grouping, which would benefit from having the input sorted by
> "a, b". How exactly would this behaves depends on the number of distinct
> values in the columns, how expensive the comparisons are for each
> column, and so on. But you get the idea.
>
> I admit I haven't managed to construct a reasonably query that'd be
> significantly harmed by this, but I haven't spent a lot of time on it.
>
> I'm pretty sure I used this trick in the past, when doing aggregations
> on large data sets, where it was much better to "pipe" the data through
> multiple GroupAggregate nodes.
>
> For this reason I think the code generating paths should work a bit like
> get_useful_pathkeys_for_relation and generate_useful_gather_paths.
>
> * group_keys_reorder_by_pathkeys should not produce just one list of
>   pathkeys, but multiple interesting options. At the moment it should
>   produce at least the input grouping pathkeys (as specified in the
>   query), and the "optimal" pathkeys (by lowest cost).
>
> * the places calling group_keys_reorder_by_pathkeys should loop on the
>   result, and generate separate path for each option.
>
> I'd guess in the future we'll "peek forward" in the plan and see if
> there are other interesting pathkeys (that's the expectation for
> get_useful_pathkeys_for_relation).
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

The patch does not apply successfully, can you please take a look at that

http://cfbot.cputube.org/patch_33_1651.log

=== applying patch ./0001-v12-20210309.patch

atching file src/include/utils/selfuncs.h
Hunk #1 FAILED at 198.
1 out of 1 hunk FAILED -- saving rejects to file
src/include/utils/selfuncs.h.rej


Based on @Tomas Vondra comments, and patch does not apply, I am changing
the status to "Waiting On Author".

--

Ibrar Ahmed

Re: POC: GROUP BY optimization

Reply via email to