Re: POC: GROUP BY optimization

Tomas Vondra Wed, 06 Jun 2018 06:54:02 -0700


On 06/05/2018 07:56 PM, Teodor Sigaev wrote:

Thanks for the patch. This (missing) optimization popped-up repeatedlyrecently, and I was planning to look into it for PG12. So now I don'thave to, because you've done all the hard work ;-)
You are welcome. Actually one of out customers faced the problem withGROUP BY column order and exactly with reordering without any indexes,you mean it as problem 2). Index optimization was noticed by me later.But based on your suggested patch's order I split the patch to indexand non-index part and second part depends of first one. They touch thesame part of code and they could not be independent


The way I see it the patch does two different things:

a) consider additional indexes by relaxing the pathkeys check

b) if there are no indexes, i.e. when an explicit Sort is needed,consider reordering the columns by ndistinct

Not sure why those two parts couldn't be separated. I haven't triedsplitting the patch, of course, so I may be missing something.

In the worst case, one part will depend on the other, which is OK. Itstill allows us to commit the first part and continue working on theother one, for example.

1) add_path() ensures that we only keep the one cheapest path sortedpath for each pathkeys. This patch somewhat defeats that because itconsiders additional pathkeys (essentially any permutation of groupkeys) as interesting. So we may end up with more paths in the list.
Seems, index scan here could give benefits here only if:
   1) it's a index only scan
2) it's a index full (opposite to only) scan but physical order ofheap is
      close to logical index order (ie table is clustered)
In other cases costs of disk seeking will be very high. But on thisstage of planing we don't know that facts yet. So we couldn't make agood decision here and should believe in add_path() logic.

Not sure what you mean? Surely we do costing of the paths at this stage,so we can decide which one is cheaper etc. The decision which paths tokeep is done by add_path(), and it should stay like this, of course. Iwasn't suggesting to move the logic elsewhere.

> I wonder if we should make the add_path() logic smarter to recognizewhen two> paths have different pathkeys but were generated to match the samegrouping,> to reduce the number of paths added by this optimization. Currentlywe do that > for each pathkeys list independently, but we're consideringmany more pathkeys > orderings ...
Hm. I tend to say no.
select .. from t1 group by a, b
union
select .. from t2 group by a, b
t1 and t2 could have different set of indexes and differentdistribution, so locally it could be cheaper to use one index (forexample, one defined as (b, a) and second as (a,b,c,d) - second islarger) but totally - another (example: second table doesn't have (b,a)index)

But add_path() treats each of the relations independently, why couldn'twe pick a different index for each of the two relations?

2) sort reordering based on ndistinct estimates
But thinking about this optimization, I'm worried it relies on acouple of important assumptions. For now those decisions could have bemade by the person writing the SQL query, but this optimization makesthat impossible. So we really need to get this right.
Hm, sql by design should not be used that way, but, of course, it's used :(

Well, yes and no. I'm not worried about people relying on us to givethem some ordering - they can (and should) add an ORDER BY clause to fixthat. I'm more worried about the other stuff.

For example, it seems to disregard that different data types havedifferent comparison costs. For example comparing bytea will be farmore expensive compared to int4, so it may be much more efficient tocompare int4 columns first, even if there are far fewer distinctvalues in them.
as I can see cost_sort() doesn't pay attention to this details. And itshould be a separated patch to improve that.


But sort also does not reorder columns.

Imagine you have a custom data type that is expensive for comparisons.You know that, so you place it at the end of GROUP BY clauses, to reducethe number of comparisons on that field. And then we come along and justreorder the columns, placing it first, because it happens to have a highndistinct statistic. And there's no way to get the original behavior :-(

Also, simply sorting the columns by their ndistinct estimate issomewhat naive, because it assumes the columns are independent.Imagine for example a table with three columns:So clearly, when evaluating GROUP BY a,b,c it'd be more efficient touse "(a,c,b)" than "(a,b,c)" but there is no way to decide this merelyusing per-column ndistinct values. Luckily, we now have ndistinctmulti-column coefficients which could be used to decide this I believe(but I haven't tried).
Agree, but I don't know how to use it here. Except, may be:
1) first column - the column with bigger estimated number of groups
2) second column - the pair of (first, second) with bigger estimatednumber of groups
3) third column - the triple of (first, second, third) with bigger ...
But seems even with that algorithm, ISTM, it could be implemented incheaper manner.


Maybe. I do have some ideas, although I haven't tried implementing it.

If there's no extended statistic on the columns, you can do the currentthing (assuming independence etc.). There's not much we can do here.

If there's an extended statistic, you can do either a greedy search (getthe next column with the highest ndistinct coefficient) or exhaustivesearch (computing the estimated number of comparisons).

Another challenge is that using only the ndistinct coefficient assumesuniform distribution of the values. But you can have a column with 1Mdistinct values, where a single value represents 99% of the rows. Andanother column with 100k distinct values, with actual uniformdistribution. I'm pretty sure it'd be more efficient to place the 100kcolumn first.

The real issue however is that this decision can't be made entirelylocally. Consider for example this:
     explain select a,b,c, count(*) from t group by a,b,c order by c,b,a;
Which is clearly cheaper (at least according to the optimizer) thandoing two separate sorts. So the optimization actually does the wrongthing here, and it needs to somehow consider the other orderingrequirements (in this case ORDER BY) - either by generating multiplepaths with different orderings or some heuristics.
Hm, thank you. I consider it is a bug of my implementation - basic ideawas that we try to match already existing or needed order and only if wefail or have unmatched tail of pathkey list than we will try to findcheapest column order.Fixed in v7 (0002-opt_group_by_index_and_order-v7.patch), but may be bynaive way: if we don't have a path pathkey first try to reorder columnsaccordingly to order by clause. Test for your is also added.


OK. I'll give it a try.

I'm also wondering how freely we can change the group by resultordering. Assume you disable hash aggregate and parallel query -currently we are guaranteed to use group aggregate that producesexactly the ordering as specified in GROUP BY, but this patch removesthat "guarantee" and we can produce arbitrary permutation of theordering. But as I argued in other threads, such implicit guaranteesare really fragile, can silently break for arbitrary reasons (say,parallel query will do just that) and can be easily fixed by adding aproper ORDER BY. So I don't think this is an issue.
Agree. SQL by design doesn't give a warranty of particular order withoutexplicit ORDER BY clause.
The other random thought is how will/could this interact with theincremental sorts patch. I know Alexander wanted to significantlylimit where we actually consider the incremental sort, but I'm notsure if this was one of those places or not (is sure looks like aplace where we would greatly benefit from it).
Seems, they should not directly interact. Patch tries to find cheapestcolumn order, Alexander's patch tries to make sort cheaper - they are adifferent tasks. But I will try.


Thanks.

So to summarize this initial review - I do suggest splitting the patchinto two parts. One that does the index magic, and one for thisreordering optimization. The first part (additional indexes) seemsquite fairly safe, likely to get committable soon. The other part(ndistinct reordering) IMHO requires more thought regarding costingand interaction with other query parts.
Thank you for review!


;-)


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: POC: GROUP BY optimization

Reply via email to