On Tue, Nov 22, 2016 at 1:37 PM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
>> > Yes, I am, and I disagree with you.  The current decision on this point
>> > was made ages ago, before autovacuum even existed let alone relied on
>> > the stats for proper functioning.  The tradeoff you're saying you're
>> > okay with is "we'll shut down a few seconds faster, but you're going
>> > to have table bloat problems later because autovacuum won't know it
>> > needs to do anything".  I wonder how many of the complaints we get
>> > about table bloat are a consequence of people not realizing that
>> > "pg_ctl stop -m immediate" is going to cost them.
>>
>> That would be useful information to have, but I bet the answer is "not
>> that many".  Most people don't shut down their database very often;
>> they're looking for continuous uptime.  It looks to me like autovacuum
>> activity causes the statistics files to get refreshed at least once
>> per autovacuum_naptime, which defaults to once a minute, so on the
>> average we're talking about the loss of perhaps 30 seconds worth of
>> statistics.
>
> I think you're misunderstanding how this works.  Losing that file
> doesn't lose just the final 30 seconds worth of data -- it loses
> *everything*, and every counter goes back to zero.  So it's not a few
> parts-per-million, it loses however many millions there were.

OK, that's possible, but I'm not sure.  I think there are two separate
issues here.  One is whether we should nuke the stats file on
recovery, and the other is whether we should force a final write of
the stats file before agreeing to an immediate shutdown.  It seems to
me that the first one affects whether all of the counters go to zero,
and the second affects whether we lose a small amount of data from
just prior to the shutdown.  Right now, we are doing the first, so the
second is a waste.  If we decide to start doing the first, we can
independently decide whether to also do the second.

>> I also think that you're wildly overestimating the likelihood that
>> writing the stats file will be fast, because (1) anything that
>> involves writing to the disk can be very slow, either because there's
>> a lot of other write activity or because the disk is going bad, the
>> latter being actually a pretty common cause of emergency database
>> shutdowns, (2) the stats files can be quite large if the database
>> system contains hundreds of thousands or even millions of objects,
>> which is not all that infrequent, and (3) pgstat wait timeouts are
>> pretty common, which would not be the case if writing the file was
>> invariably fast (c.f. 75b48e1fff8a4dedd3ddd7b76f6360b5cc9bb741).
>
> Those writes are slow because of the concurrent activity.  If all
> backends just throw their hands in the air, no more writes come from
> them, so the OS is going to finish the writes pretty quickly (or at
> least empty enough of the caches so that the pgstat data fits); so
> neither (1) nor (3) should be terribly serious.  I agree that (2) is a
> problem, but it's not a problem for everyone.

If the operating system buffer cache doesn't contain much dirty data,
then I agree.  But there is a large backlog of dirty data there, then
it might be quite slow.

>> >> ...  Yeah, it's not good, but neither are the things that prompt
>> >> people to perform an immediate shutdown in the first place.
>> >
>> > Really?  I think many users think an immediate shutdown is just fine.
>>
>> Why would anybody ever perform an immediate shutdown rather than a
>> fast shutdown if a fast shutdown were fast?
>
> A fast shutdown is not all that fast -- it needs to write the whole
> contents of shared buffers down to disk, which may be enormous.
> Millions of times bigger than pgstat data.  So a fast shutdown is
> actually very slow in a large machine.  An immediate shutdown, even if
> it writes pgstat data, is still going to be much smaller in terms of
> what is written.

I agree.  However, in many cases, the major cost of a fast shutdown is
getting the dirty data already in the operating system buffers down to
disk, not in writing out shared_buffers itself.  The latter is
probably a single-digit number of gigabytes, or maybe double-digit.
The former might be a lot more, and the write of the pgstat file may
back up behind it.  I've seen cases where an 8kB buffered write from
Postgres takes tens of seconds to complete because the OS buffer cache
is already saturated with dirty data, and the stats files could easily
be a lot more than that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to