On Thu, Nov 28, 2024 at 4:39 AM Andrei Lepikhov <lepi...@gmail.com> wrote: > Thanks to take a look! > > On 11/25/24 23:45, Heikki Linnakangas wrote: > > On 24/09/2024 08:08, Andrei Lepikhov wrote: > >> + * proves the var is unique for this query. However, we'd better > >> still > >> + * believe the null-fraction statistic. > >> */ > >> if (vardata->isunique) > >> stadistinct = -1.0 * (1.0 - stanullfrac); > > > > I wonder about the "we'd better still believe the null-fraction > > statistic" part. It makes sense for a unique index, but a DISTINCT or > > GROUP BY collapses all the NULLs to a single row. So I think there's > > some more work to be done here. > IMO, in that particular case, it is not an issue: having GROUP-BY, we > set vardata->isunique field and disallowed to recurse into the Var > statistics inside subquery - likewise, DISTINCT already does. So, we > have stanullfrac == 0 - it means the optimiser doesn't count the number > of NULLs. In the case of the UNIQUE index, the optimiser will have the > stanullfrac statistic and count NULLs.
This sounds convincing. > But your question raised one another. May we add to a node some > vardata_extra, which could count specific conditions, and let upper > nodes consider it using the Var statistic? > For example, we can separate the 'unique set of columns' knowledge in > such a structure for the Aggregate node. Also, it could be a solution to > problem of counting nulls, generated by RHS of OUTER JOINs in query tree. > What's more, look at the query: > > CREATE TABLE gu_2 (x real); > INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs; > INSERT INTO gu_2 (x) SELECT NULL FROM generate_series(1,100) AS gs; > VACUUM ANALYZE gu_2; > > HashAggregate (cost=20.91..22.35 rows=144 width=4) > (actual rows=50 loops=1) > Group Key: gu_2.x > Batches: 1 Memory Usage: 40kB > -> HashAggregate (cost=19.11..20.55 rows=144 width=4) > (actual rows=50 loops=1) > Group Key: gu_2.x > Batches: 1 Memory Usage: 40kB > -> Seq Scan on gu_2 (cost=0.00..18.75 rows=145 width=4) > (actual rows=149 loops=1) > Filter: ((x < '50'::double precision) OR (x IS NULL)) > Rows Removed by Filter: 951 > > Here we also could count number of scanned NULLs separately in > vardata_extra and use it in upper GROUP-BY estimation. What could be the type of vardata_extra? And what information could it store? Yet seems too sketchy for me to understand. But, I think for now we should go with the original patch. It seems to be quite straightforward extension to what 4767bc8ff2 does. I've revised commit message and applied pg_indent to sources. I'm going to push this if no objections. ------ Regards, Alexander Korotkov Supabase
v2-0001-Improve-statistics-estimation-for-single-column-G.patch
Description: Binary data