Alex Shulgin <alex.shul...@gmail.com> writes:
> This recalled observation can now also explain to me why in the regression
> you've seen, the short path was not followed: my bet is that stadistinct
> appeared negative.

Yes, I think that's right.  The table under consideration had just a few
live rows (I think 3), so that even though there was only one value in
the sample, the "if (stats->stadistinct > 0.1 * totalrows)" condition
succeeded.

> Given that we change the logic in the complex path substantially, the
> assumptions that lead to the "Take all MVCs" condition above might no
> longer hold, and I see it as a pretty compelling argument to remove the
> extra checks, thus keeping the only one: track_cnt == ndistinct.  This
> should also bring the patch's effect more close to the thread's topic,
> which is "More stable query plans".

The reason for checking toowide_cnt is that if it's greater than zero,
then in fact the track list does NOT include all values seen, and it's
flat-out wrong to claim that it is an exhaustive set of values.

The reason for the track_cnt <= num_mcv condition is that if that's not
true, the track list has to be trimmed to meet the statistics target.
Again, that's not optional.

I think the reasoning for having the stats->stadistinct > 0 test in there
was that if we'd set it negative, then we think that the set of distinct
values will grow --- which again implies that the set of values actually
seen should not be considered exhaustive.  Of course, with a table as
small as that regression-test example, we have little evidence to support
either that conclusion or its opposite.

It's possible that what we should do to eliminate the sudden change
of behaviors is to drop the "track list includes all values seen, and all
will fit" code path entirely, and always go through the track list
one-at-a-time.

If we do, though, the currently-proposed filter rules aren't going to
be too satisfactory: if we have a relatively small group of roughly
equally common MCVs, this logic would reject all of them, which is
surely not what we want.

The point of the original logic was to try to decide whether the
values in the sample are significantly more common than typical values
in the whole table population.  I think we may have broken that with
3d3bf62f3: you can't make any such determination if you consider only
what's in the sample without trying to estimate what is not in the
sample.

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to