Re: Consider the number of columns in the sort cost model

Alena Rybakina Mon, 28 Oct 2024 02:48:51 -0700

Hi! Thank you for your work on this subject!

On 23.10.2024 04:39, Andrei Lepikhov wrote:

On 15/10/2024 12:15, David Rowley wrote:
On Tue, 15 Oct 2024 at 17:48, Andrei Lepikhov <[email protected]> wrote:
I think maybe what is worth working on is seeing if you can better
estimate the number of tiebreak comparisons by checking the number of
distinct values for each sort key.  That might require what we
discussed on another thread about having estimate_num_groups() better
use the n_distinct estimate from the EquivalenceMember with the least
distinct values.  It'll be another question if all that can be done
and this all still perform well enough for cost_sort(). You may have
to write it first before we can figure that out.  It may be possible
to buy back some of the overheads with some additional caching...
Perhaps that could be done within the EquivalenceClass.
It seems that your idea with an equivalence class works.
In the patchset attached, I show all the ideas (I initially wanted todiscuss it step-by-step).In the attached, the first patch is about choosing adequate expressionfrom the list of EC members.The second one is the current patch, which considers the number ofsort columns.Third patch - elaboration of the second patch. Use only the trustfulstatistics here - the number of ndistinct values on the first sortedcolumn.And the last patch is a demo of how I'm going to use the previousthree patches and add one more strategy to improve the order ofcolumns in the GROUP-BY clause.

I have some question about your patches.

1. You consider two lists root->group_pathkeys androot->processed_groupClause.

As far as I understand, this is due to the fact that the pathkeyelements are identical to the group clause and identity verification isprecisely checked by this condition:


if (foreach_current_index(lc1) >= root->num_groupby_pathkeys)

To be honest, I didn’t find information about this in the code, but didI understand everything correctly?

2. I noticed that statistics of distinct values are calculated severaltimes. Maybe I miss something, but this can be avoided if thesestatistics can be saved after calculation. For example, I saw that it ispossible that you calculate the same statistic information for the sameequivalence members in cost_incremental_sort and identify_sort_ecmember.Is it possible to store information in a structure and use it later?

3. I think you should initialize the variable ndist in your patch. Ifaced the compile complaining during your code compilation.


costsize.c: In function ‘identify_sort_ecmember’:

costsize.c:6694:42: error: ‘ndist’ may be used uninitialized[-Werror=maybe-uninitialized]

 6694 |                         em->em_ndistinct = ndist;
      |                         ~~~~~~~~~~~~~~~~~^~~~~
costsize.c:6680:33: note: ‘ndist’ was declared here
 6680 |                         double  ndist;
      |                                 ^~~
cc1: all warnings being treated as errors
gmake[4]: *** [<builtin>: costsize.o] Error 1

4. Before executing examine_variable function I think you should checkthat it is not null.

@@ -575,6 +575,8 @@ get_useful_group_keys_orderings(PlannerInfo *root,Path *path)

              */
             continue;

+        Assert (node != NULL);
+
         examine_variable(root, node, 0, &vardata);
         if (!HeapTupleIsValid(vardata.statsTuple))
             continue;


--
Regards,
Alena Rybakina
Postgres Professional

Re: Consider the number of columns in the sort cost model

Reply via email to