Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

Alexander Korotkov Sun, 16 Feb 2025 17:06:58 -0800

On Thu, Nov 28, 2024 at 4:39 AM Andrei Lepikhov <lepi...@gmail.com> wrote:
> Thanks to take a look!
>
> On 11/25/24 23:45, Heikki Linnakangas wrote:
> > On 24/09/2024 08:08, Andrei Lepikhov wrote:
> >> +     * proves the var is unique for this query.  However, we'd better
> >> still
> >> +     * believe the null-fraction statistic.
> >>       */
> >>      if (vardata->isunique)
> >>          stadistinct = -1.0 * (1.0 - stanullfrac);
> >
> > I wonder about the "we'd better still believe the null-fraction
> > statistic" part. It makes sense for a unique index, but a DISTINCT or
> > GROUP BY collapses all the NULLs to a single row. So I think there's
> > some more work to be done here.
> IMO, in that particular case, it is not an issue: having GROUP-BY, we
> set vardata->isunique field and disallowed to recurse into the Var
> statistics inside subquery - likewise, DISTINCT already does. So, we
> have stanullfrac == 0 - it means the optimiser doesn't count the number
> of NULLs. In the case of the UNIQUE index, the optimiser will have the
> stanullfrac statistic and count NULLs.


This sounds convincing.

> But your question raised one another. May we add to a node some
> vardata_extra, which could count specific conditions, and let upper
> nodes consider it using the Var statistic?
> For example, we can separate the 'unique set of columns' knowledge in
> such a structure for the Aggregate node. Also, it could be a solution to
> problem of counting nulls, generated by RHS of OUTER JOINs in query tree.
> What's more, look at the query:
>
> CREATE TABLE gu_2 (x real);
> INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
> INSERT INTO gu_2 (x) SELECT NULL FROM generate_series(1,100) AS gs;
> VACUUM ANALYZE gu_2;
>
>   HashAggregate  (cost=20.91..22.35 rows=144 width=4)
>                  (actual rows=50 loops=1)
>     Group Key: gu_2.x
>     Batches: 1  Memory Usage: 40kB
>     ->  HashAggregate  (cost=19.11..20.55 rows=144 width=4)
>         (actual rows=50 loops=1)
>           Group Key: gu_2.x
>           Batches: 1  Memory Usage: 40kB
>           ->  Seq Scan on gu_2  (cost=0.00..18.75 rows=145 width=4)
>              (actual rows=149 loops=1)
>                 Filter: ((x < '50'::double precision) OR (x IS NULL))
>                 Rows Removed by Filter: 951
>
> Here we also could count number of scanned NULLs separately in
> vardata_extra and use it in upper GROUP-BY estimation.

What could be the type of vardata_extra?  And what information could
it store?  Yet seems too sketchy for me to understand.

But, I think for now we should go with the original patch.  It seems
to be quite straightforward extension to what 4767bc8ff2 does.  I've
revised commit message and applied pg_indent to sources.  I'm going to
push this if no objections.

------
Regards,
Alexander Korotkov
Supabase

v2-0001-Improve-statistics-estimation-for-single-column-G.patch
Description: Binary data

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

Reply via email to