(2018/04/13 3:49), Robert Haas wrote:
On Thu, Apr 12, 2018 at 12:49 PM, Tom Lane<t...@sss.pgh.pa.us>  wrote:
We've been seeing $subject since commit 1bc0100d2, with little clue
as to the cause.  Previous attempts to fix it by preventing autovacuum
from running on the relevant tables didn't seem to help.

I have now managed to reproduce the issue reliably enough to study it.
(It turns out that on longfin's host, running the buildfarm script *under
cron* produces the failure a reasonable percentage of the time.  The
identical test case, launched from an interactive shell, never fails.
I speculate that the kernel scheduler treats cron jobs differently.)
It is in fact a timing issue, and what triggers it is there being an
active autovacuum task in another database.  The fact that the buildfarm
script uses USE_MODULE_DB in the contrib tests greatly increases the
surface area of the problem, since that creates a lot more databases that
autovacuum could be active in.  Once I realized that, I found that it can
trivially be reproduced by hand, in a "make installcheck" test, simply by
having an open transaction in another DB in the cluster.  For instance
"select pg_sleep(60);" and then run make installcheck in postgres_fdw.

So now, drilling down to the specific problem: what we're seeing is that
the plans for some queries change because the "remote" server reports
a different rowcount estimate than usual for

EXPLAIN SELECT "C 1", c2, c4, c5, c6, c7, c8, ctid FROM "S 1"."T 1" WHERE (("C 
1">  2000)) FOR UPDATE

Normally the estimate is one, but in the failure cases it's 12 or so.
This estimate is coming out of ineq_histogram_selectivity, of course.
Furthermore, the highest value actually present in the histogram is
1000, because that was the highest value of "C 1" when the test did
ANALYZE up at the top.  That means we'll invoke get_actual_variable_range
to try to identify the actual current maximum.  Most of the time, it
returns 2010, which is correct, and we end up estimating that only
about one row will exceed 2000.

However, in the failure cases, what's getting extracted from the index
is 9999.  That's because we'd inserted and deleted that value further
up in the test, and if there's been an open transaction holding back
RecentGlobalXmin, then SnapshotNonVacuumable is going to consider that
entry still good.  This value is enough larger than 2000 to move the
selectivity estimate appreciably, and then kaboom.

This theory successfully explains the observed fact that some buildfarm
failures show differences in two query plans, while others show a
difference just in the first one.  In the latter cases, the external
transaction ended, so that RecentGlobalXmin could advance, in between.

What I propose to do to fix the instability is to change the test
stanza that uses 9999 as a key-chosen-at-random to use something less
far away from the normal range of "C 1" values, so that whether it's
still visible to get_actual_variable_range has less effect on this
selectivity estimate.  That's a hack, for sure, but I don't see any
other fix that's much less of a hack.

Thanks for the detective work.

Thanks a lot!

Best regards,
Etsuro Fujita

Reply via email to