On Tue, Jun 16, 2020 at 7:26 PM amul sul <[email protected]> wrote:
>
> Hi,
>
> Attached patch proposes $Subject feature which forces the system into
> read-only
> mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
> WRITE executed.
>
> The high-level goal is to make the availability/scale-out situation better.
> The feature
> will help HA setup where the master server needs to stop accepting WAL writes
> immediately and kick out any transaction expecting WAL writes at the end, in
> case
> of network down on master or replication connections failures.
>
> For example, this feature allows for a controlled switchover without needing
> to shut
> down the master. You can instead make the master read-only, wait until the
> standby
> catches up, and then promote the standby. The master remains available for
> read
> queries throughout, and also for WAL streaming, but without the possibility
> of any
> new write transactions. After switchover is complete, the master can be shut
> down
> and brought back up as a standby without needing to use pg_rewind.
> (Eventually, it
> would be nice to be able to make the read-only master into a standby without
> having
> to restart it, but that is a problem for another patch.)
>
> This might also help in failover scenarios. For example, if you detect that
> the master
> has lost network connectivity to the standby, you might make it read-only
> after 30 s,
> and promote the standby after 60 s, so that you never have two writable
> masters at
> the same time. In this case, there's still some split-brain, but it's still
> better than what
> we have now.
>
> Design:
> ----------
> The proposed feature is built atop of super barrier mechanism commit[1] to
> coordinate
> global state changes to all active backends. Backends which executed
> ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
> process to change the requested WAL read/write state aka WAL prohibited and
> WAL
> permitted state respectively. When the checkpointer process sees the WAL
> prohibit
> state change request, it emits a global barrier and waits until all backends
> that
> participate in the ProcSignal absorbs it. Once it has done the WAL read/write
> state in
> share memory and control file will be updated so that XLogInsertAllowed()
> returns
> accordingly.
>
Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well? If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?
> If there are open transactions that have acquired an XID, the sessions are
> killed
> before the barrier is absorbed.
>
What about prepared transactions?
> They can't commit without writing WAL, and they
> can't abort without writing WAL, either, so we must at least abort the
> transaction. We
> don't necessarily need to kill the session, but it's hard to avoid in all
> cases because
> (1) if there are subtransactions active, we need to force the top-level abort
> record to
> be written immediately, but we can't really do that while keeping the
> subtransactions
> on the transaction stack, and (2) if the session is idle, we also need the
> top-level abort
> record to be written immediately, but can't send an error to the client until
> the next
> command is issued without losing wire protocol synchronization. For now, we
> just use
> FATAL to kill the session; maybe this can be improved in the future.
>
> Open transactions that don't have an XID are not killed, but will get an
> ERROR if they
> try to acquire an XID later, or if they try to write WAL without acquiring an
> XID (e.g. VACUUM).
>
What if vacuum is on an unlogged relation? Do we allow writes via
vacuum to unlogged relation?
> To make that happen, the patch adds a new coding rule: a critical section
> that will write
> WAL must be preceded by a call to CheckWALPermitted(), AssertWALPermitted(),
> or
> AssertWALPermitted_HaveXID(). The latter variants are used when we know for
> certain
> that inserting WAL here must be OK, either because we have an XID (we would
> have
> been killed by a change to read-only if one had occurred) or for some other
> reason.
>
> The ALTER SYSTEM READ WRITE command can be used to reverse the effects of
> ALTER SYSTEM READ ONLY. Both ALTER SYSTEM READ ONLY and ALTER
> SYSTEM READ WRITE update not only the shared memory state but also the control
> file, so that changes survive a restart.
>
> The transition between read-write and read-only is a pretty major transition,
> so we emit
> log message for each successful execution of a ALTER SYSTEM READ {ONLY |
> WRITE}
> command. Also, we have added a new GUC system_is_read_only which returns "on"
> when the system is in WAL prohibited state or recovery.
>
> Another part of the patch that quite uneasy and need a discussion is that
> when the
> shutdown in the read-only state we do skip shutdown checkpoint and at a
> restart, first
> startup recovery will be performed and latter the read-only state will be
> restored to
> prohibit further WAL write irrespective of recovery checkpoint succeed or
> not. The
> concern is here if this startup recovery checkpoint wasn't ok, then it will
> never happen
> even if it's later put back into read-write mode.
>
I am not able to understand this problem. What do you mean by
"recovery checkpoint succeed or not", do you add a try..catch and skip
any error while performing recovery checkpoint?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com