Hi! Thank you for your work on this subject!

On 23.10.2024 04:39, Andrei Lepikhov wrote:
On 15/10/2024 12:15, David Rowley wrote:
On Tue, 15 Oct 2024 at 17:48, Andrei Lepikhov <lepi...@gmail.com> wrote:
I think maybe what is worth working on is seeing if you can better
estimate the number of tiebreak comparisons by checking the number of
distinct values for each sort key.  That might require what we
discussed on another thread about having estimate_num_groups() better
use the n_distinct estimate from the EquivalenceMember with the least
distinct values.  It'll be another question if all that can be done
and this all still perform well enough for cost_sort(). You may have
to write it first before we can figure that out.  It may be possible
to buy back some of the overheads with some additional caching...
Perhaps that could be done within the EquivalenceClass.
It seems that your idea with an equivalence class works.
In the patchset attached, I show all the ideas (I initially wanted to discuss it step-by-step). In the attached, the first patch is about choosing adequate expression from the list of EC members. The second one is the current patch, which considers the number of sort columns. Third patch - elaboration of the second patch. Use only the trustful statistics here - the number of ndistinct values on the first sorted column. And the last patch is a demo of how I'm going to use the previous three patches and add one more strategy to improve the order of columns in the GROUP-BY clause.

I have some question about your patches.

1. You consider two lists root->group_pathkeys and root->processed_groupClause.

As far as I understand, this is due to the fact that the pathkey elements are identical to the group clause and identity verification is precisely checked by this condition:

if (foreach_current_index(lc1) >= root->num_groupby_pathkeys)

To be honest, I didn’t find information about this in the code, but did I understand everything correctly?

2. I noticed that statistics of distinct values ​​are calculated several times. Maybe I miss something, but this can be avoided if these statistics can be saved after calculation. For example, I saw that it is possible that you calculate the same statistic information for the same equivalence members in cost_incremental_sort and identify_sort_ecmember. Is it possible to store information in a structure and use it later?

3. I think you should initialize the variable ndist in your patch. I faced the compile complaining during your code compilation.

costsize.c: In function ‘identify_sort_ecmember’:
costsize.c:6694:42: error: ‘ndist’ may be used uninitialized [-Werror=maybe-uninitialized]
 6694 |                         em->em_ndistinct = ndist;
      |                         ~~~~~~~~~~~~~~~~~^~~~~
costsize.c:6680:33: note: ‘ndist’ was declared here
 6680 |                         double  ndist;
      |                                 ^~~
cc1: all warnings being treated as errors
gmake[4]: *** [<builtin>: costsize.o] Error 1

4. Before executing examine_variable function I think you should check that it is not null.

@@ -575,6 +575,8 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path)
              */
             continue;

+        Assert (node != NULL);
+
         examine_variable(root, node, 0, &vardata);
         if (!HeapTupleIsValid(vardata.statsTuple))
             continue;


--
Regards,
Alena Rybakina
Postgres Professional



Reply via email to