On Sun, Apr 3, 2016 at 10:53 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Alex Shulgin <alex.shul...@gmail.com> writes:
> > This recalled observation can now also explain to me why in the
> regression
> > you've seen, the short path was not followed: my bet is that stadistinct
> > appeared negative.
> Yes, I think that's right.  The table under consideration had just a few
> live rows (I think 3), so that even though there was only one value in
> the sample, the "if (stats->stadistinct > 0.1 * totalrows)" condition
> succeeded.

Yeah, this part of the logic can be really surprising at times.

> Given that we change the logic in the complex path substantially, the
> > assumptions that lead to the "Take all MVCs" condition above might no
> > longer hold, and I see it as a pretty compelling argument to remove the
> > extra checks, thus keeping the only one: track_cnt == ndistinct.  This
> > should also bring the patch's effect more close to the thread's topic,
> > which is "More stable query plans".
> The reason for checking toowide_cnt is that if it's greater than zero,
> then in fact the track list does NOT include all values seen, and it's
> flat-out wrong to claim that it is an exhaustive set of values.

But do we really state that with the short path?

If there would be only one too wide value, it might be the only thing left
for the histogram in the end and will be discarded anyway, so from the end
result perspective there is no difference.

If there are multiple too wide values, they will be effectively discarded
by the histogram calculation part also, so again no difference from the
perspective of the end result.

The reason for the track_cnt <= num_mcv condition is that if that's not
> true, the track list has to be trimmed to meet the statistics target.
> Again, that's not optional.

Yes, but this check we only need in compute_distinct_stats() and we are
talking about compute_scalar_stats() now where track_cnt is always less
than or equals to num_mcv (again, please see at the bottom of the
thread-starting email), or is my analysis broken on this part?

I think the reasoning for having the stats->stadistinct > 0 test in there
> was that if we'd set it negative, then we think that the set of distinct
> values will grow --- which again implies that the set of values actually
> seen should not be considered exhaustive.

This is actually very neat.  So the idea here as I get it is that if we
have enough distinct values to suspect that more unique ones will be added
later as the table grows (which is a natural tendency with most of the
tables anyway), then at the moment the statistics we produce are going to
be actually used by the planner, it is likely that we no longer cover all
the distinct values by the MCV list, right?

I would *love* to see that documented in code comments at the least.

Of course, with a table as
> small as that regression-test example, we have little evidence to support
> either that conclusion or its opposite.

I think it might be possible to record historical ndistinct values between
the ANALYZE runs and use that as better evidence that the number of
distincts is actually growing rather than basing that decision on that
hard-coded 10% limit rule.  What do you think?

We do not support migration of pg_statistic system table during major
version upgrades (yet), so if we somehow achieve what I've just described,
it might be not a compatibility-breaking change anyway.

It's possible that what we should do to eliminate the sudden change
> of behaviors is to drop the "track list includes all values seen, and all
> will fit" code path entirely, and always go through the track list
> one-at-a-time.

That could also be an option, that I have considered initially.  Now that I
read your explanation of each check, I'm not that sure anymore.

If we do, though, the currently-proposed filter rules aren't going to
> be too satisfactory: if we have a relatively small group of roughly
> equally common MCVs, this logic would reject all of them, which is
> surely not what we want.

Indeed. :-(

> The point of the original logic was to try to decide whether the
> values in the sample are significantly more common than typical values
> in the whole table population.  I think we may have broken that with
> 3d3bf62f3: you can't make any such determination if you consider only
> what's in the sample without trying to estimate what is not in the
> sample.

Speaking of rabbit holes...

I'm out of ideas, unfortunately.  We badly need more eyes/brainpower on
this, which is why I have submitted a talk proposal on this topic to
PGDay.ru this summer in St. Petersburg, fearing that it might be too late
to commit a satisfactory version during the current dev cycle for 9.6, and
in hope to draw at least some attention to it.


Reply via email to