On 12/23/21 15:42, Fujii Masao wrote:
On 2021/12/23 3:49, Tomas Vondra wrote:
Attached is a patch tweaking WAL logging - in wal_level=minimal we do
the same thing as now, in higher levels we log every sequence fetch.
Thanks for the patch!
With the patch, I found that the regression test for sequences failed.
+ fetch = log = fetch;
This should be "log = fetch"?
On second thought, originally a sequence doesn't guarantee that the
value already returned by nextval() will never be returned by subsequent
nextval() after the server crash recovery. That is, nextval() may return
the same value across crash recovery. Is this understanding right? For
example, this case can happen if the server crashes after nextval()
returned the value but before WAL for the sequence was flushed to the
permanent storage.
I think the important step is commit. We don't guarantee anything for
changes in uncommitted transactions. If you do nextval in a transaction
and the server crashes before the WAL gets flushed before COMMIT, then
yes, nextval may generate the same nextval again. But after commit that
is not OK - it must not happen.
So it's not a bug that sync standby may return the same value as
already returned in the primary because the corresponding WAL has not
been replicated yet, isn't it?
No, I don't think so. Once the COMMIT happens (and gets confirmed by the
sync standby), it should be possible to failover to the sync replica
without losing any data in committed transaction. Generating duplicate
values is a clear violation of that.
IMHO the fact that we allow a transaction to commit (even just locally)
without flushing all the WAL it depends on is clearly a data loss bug.
BTW, if the returned value is stored in database, the same value is
guaranteed not to be returned again after the server crash or by sync
standby. Because in that case the WAL of the transaction storing that
value is flushed and replicated.
True, assuming the table is WAL-logged etc. I agree the issue may be
affecting a fairly small fraction of workloads, because most people use
sequences to generate data for inserts etc.
So I think this makes it acceptable / manageable. Of course, this
means the values are much less monotonous (across backends), but I
don't think we really promised that. And I doubt anyone is really
using sequences like this (just nextval) in performance critical use
cases.
I think that this approach is not acceptable to some users. So, if we
actually adopt WAL-logging every sequence fetch, also how about exposing
SEQ_LOG_VALS as reloption for a sequence? If so, those who want to log
every sequence fetch can set this SEQ_LOG_VALS reloption to 0. OTOH,
those who prefer the current behavior in spite of the risk we're
discussing at this thread can set the reloption to 32 like it is for
now, for example.
I think it'd be worth explaining why you think it's not acceptable?
I've demonstrated the impact on regular workloads (with other changes
that write stuff to WAL) is not measurable, and enabling sequence
caching eliminates most of the overhead for the rare corner case
workloads if needed. It does generate a bit more WAL, but the sequence
WAL records are pretty tiny.
I'm opposed to adding relooptions that affect correctness - it just
seems like a bad idea to me. Moreover setting the CACHE for a sequence
does almost the same thing - if you set CACHE 32, we only generate WAL
once every 32 increments. The only difference is that this cache is not
shared between backends, so one backend will generate 1,2,3,... and
another backend will generate 33,34,35,... etc. I don't think that's a
problem, because if you want strictly monotonous / gap-less sequences
you can't use our sequences anyway. Yes, with short-lived backends this
may consume the sequences faster, but well - short-lived backends are
expensive anyway and overflowing bigserial is still unlikely.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company