I've had a vague feeling for awhile now that the occasional buildfarm
failures we see in the stats regression test (where it appears that the
stats collector fails to respond to requests for no good reason) might
be related to operating-system timekeeping glitches.  Today there is
finally indisputable evidence of this happening, in this log file:
Observe the following log excerpt:

[4fdcaca3.e74:478] LOG:  statement: CREATE INDEX dupindexcols_i ON dupindexcols 
(f1, id, f1 text_pattern_ops);
[4fdcaca3.e74:479] LOG:  statement: ANALYZE dupindexcols;
[4fdcaca3.e74:480] LOG:  statement: EXPLAIN (COSTS OFF)
          SELECT count(*) FROM dupindexcols
            WHERE f1 > 'WA' and id < 1000 and f1 ~<~ 'YX';
[4fdcaca3.e74:481] LOG:  statement: SELECT count(*) FROM dupindexcols
          WHERE f1 > 'WA' and id < 1000 and f1 ~<~ 'YX';
[4fdcaca3.e74:482] LOG:  disconnection: session time: 0:00:00.000 user=pgrunner 
database=regression host=::1 port=1123
[4fdcab40.e04:1] LOG:  connection received: host=::1 port=1125
[4fdcab40.e04:2] LOG:  connection authorized: user=pgrunner database=regression
[4fdcab40.e04:3] LOG:  statement: CREATE TABLE a (aa TEXT);
[4fdcab40.444:1] LOG:  connection received: host=::1 port=1126
[4fdcab40.e88:1] LOG:  connection received: host=::1 port=1127
[4fdcab40.444:2] LOG:  connection authorized: user=pgrunner database=regression
[4fdcab40.444:3] LOG:  statement: CREATE USER regtest_unpriv_user;
[4fdcab40.444:4] LOG:  statement: CREATE SCHEMA temp_func_test;
[4fdcab40.e04:4] LOG:  statement: CREATE TABLE b (bb TEXT) INHERITS (a);

We can tell from the statements being executed that session 4fdcaca3.e74
is running the create_index regression test, while 4fdcab40.444 is
running create_function_3, and the other sessions starting concurrently
with it are part of the parallel group that runs after create_index.
So they certainly didn't start until 4fdcaca3.e74 finished.  Also
create_index is not exactly a zero-time test, so the fact that
4fdcaca3.e74 reported a session runtime of 0:00:00.000 should already
set off some alarm bells.  But remember that in the %c log_line_prefix
escape, the first part is the process's start-time timestamp in hex.
So 4fdcaca3 means Sat Jun 16 2012, 11:56:19 EDT while 4fdcab40 means
Sat Jun 16 2012, 11:50:24 EDT, nearly six minutes *earlier*.

The reported zero session time is explainable by the fact that
TimestampDifference returns zeros if the given timestamps are out of
order.  All the other process timestamps in the log are likewise
consistent with the theory that the system's clock went backwards
six-plus minutes while create_index was running.

Then, when we finally get to the stats regression test, it fails with
symptoms indicating that the stats collector never answered the
backend's requests for an updated stats file; and when the postmaster
finally shuts down and tells the stats collector to shut down too, the
final pgstat_write_statsfile call bleats like so:
[4fdcac54.a0c:1] LOG:  last_statrequest 2012-06-16 11:55:20.813625-04 is later 
than collector's time 2012-06-16 11:54:12.392375-04

Once you know that the system clock glitched like that, it's fairly
obvious what happened inside the stats collector: it wrote out the stats
file (probably in response to an autovacuum request) sometime shortly
after 11:55:20, and then the system clock went backwards to around
11:50, and so all subsequent inquiry messages had request timestamps
older than last_statwrite causing the collector to believe it needn't
emit fresh stats files.

Aside from breaking the stats regression test, a scenario like this one
has bad implications for autovacuum: for something like six minutes,
autovacuum would have been operating with stale pgstat data since the
stats file would not get rewritten, and backends would see the file
timestamp as being ahead of their own clocks so they wouldn't think it
was stale.  (So this scenario doesn't explain the occasional "pgstat
wait timeout" failures we see; though perhaps someone can think of a
variant that fits that symptom?)

I will refrain from disparaging Windows here and just note that this
would be an unsurprising occurrence on any machine not running NTP: the
owner might occasionally fix the clock once it had drifted far enough
from reality.  So I think it might be advisable to install some defenses
against the case rather than assuming it can't happen.

The most direct fix would involve checking GetCurrentTimestamp against
last_statswrite in the stats collector's main loop, but on a machine
with slow gettimeofday calls that would be pretty bad for performance.
What I suggest instead is:

(1) In backend_read_statsfile, make an initial attempt to read the stats
file and then read GetCurrentTimestamp after that.  If the local clock
reading is less than the stats file's timestamp, we know that some sort
of clock skew or glitch has happened, so force an inquiry message to be
sent with the local timestamp.  But then accept the stats file anyway,
since the skew might be small and harmless.  The reason for the forced
inquiry message is to cause (2) to happen at the collector.

(2) In pgstat_recv_inquiry, if the received inquiry_time is older than
last_statwrite, we should suspect a clock glitch (though it might just
indicate delayed message receipt).  In this case, do a fresh
GetCurrentTimestamp call, and if the reading is less than
last_statwrite, we know that the collector's time went backwards.
To recover, reset these variables as we do at startup:
        last_statrequest = GetCurrentTimestamp();
        last_statwrite = last_statrequest - 1;
to force an immediate write to happen with the "new" local time.

(1) is basically free in terms of the amount of work done in non-broken
cases, though it will require a few more lines of code.  (2) means
adding some GetCurrentTimestamp calls that did not occur before, but
hopefully these will be infrequent, since in the absence of clock
glitches they would only happen when a backend's demand for a new stats
file is generated before the collector starts to write a new stats file
but not received till afterwards.

Comments?  Anyone see a flaw in this design?  Or want to argue that
we shouldn't do anything about such cases?

                        regards, tom lane

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to