pgsql-hackers,

So I’ve put some time into a design for the incremental checksum feature and 
wanted to get some feedback from the group:

* Incremental Checksums

PostgreSQL users should have a way up upgrading their cluster to use data 
checksums without having to do a costly pg_dump/pg_restore; in particular, 
checksums should be able to be enabled/disabled at will, with the database 
enforcing the logic of whether the pages considered for a given database are 
valid.

Considered approaches for this are having additional flags to pg_upgrade to set 
up the new cluster to use checksums where they did not before (or optionally 
turning these off).  This approach is a nice tool to have, but in order to be 
able to support this process in a manner which has the database online while 
the database is going throught the initial checksum process.

In order to support the idea of incremental checksums, this design adds the 
following things:

** pg_control:

Keep "data_checksum_version", but have it indicate *only* the algorithm version 
for checksums. i.e., it's no longer used for the data_checksum enabled/disabled 
state.

Add "data_checksum_state", an enum with multiple states: "disabled", 
"enabling", "enforcing" (perhaps "revalidating" too; something to indicate that 
we are reprocessing a database that purports to have been completely 
checksummed already)

An explanation of the states as well as the behavior of the checksums for each.

- disabled => not in a checksum cycle; no read validation, no checksums 
written.  This is the current behavior for Postgres *without* checksums.

- enabling => in a checksum cycle; no read validation, write checksums.  Any 
page that gets written to disk will be a valid checksum.  This is required when 
transitioning a cluster which has never had checksums, as the page reads would 
normally fail since they are uninitialized.

- enforcing => not in a checksum cycle; read validation, write checksums.  This 
is the current behavior of Postgres *with* checksums.

 (caveat: I'm not certain the following state is needed (and the current 
version of this patch doesn't have it)):

- revalidating => in a checksum cycle; read validation, write checksums.  The 
difference between this and "enabling" is that we care if page reads fail, 
since by definition they should have been validly checksummed, as we should 
verify this.

Add "data_checksum_cycle", a counter that gets incremented with every checksum 
cycle change.  This is used as a flag to verify when new checksum actions take 
place, for instance if we wanted to upgrade/change the checksum algorithm, or 
if we just want to support periodic checksum validation.

This variable will be compared against new values in the system tables to keep 
track of which relations still need to be checksummed in the cluster.

** pg_database:

Add a field "datlastchecksum" which will be the last checksum cycle which has 
completed for all relations in that database.

** pg_class:

Add a field "rellastchecksum" which stores the last successful checksum cycle 
for each relation.

** The checksum bgworker:

Something needs to proactively checksum any relations which are needing to be 
validated, and this something is known as the checksum bgworker.  Checksum 
bgworker will operate similar to autovacuum daemons, and in fact in this 
initial pass, we'll hook into the autovac launcher due to similarities in 
catalog reading functionality as well as balancing out with other maintenance 
activity.

If autovacuum does not need to do any vacuuming work, it will check if the 
cluster has requested a checksum cycle by checking if the state is "enabling" 
(or "revalidate").  If so, it will look for any database which needs checksums 
update.  It checks the current value of the data_checksum_cycle counter and 
looks for any databases with "datlastchecksum < data_checksum_cycle".

When all database have "datlastchecksum" == data_checksum_cycle, we initiate 
checksumming of any global cluster heap files.  When the global cluster tables 
heap files have been checksummed, then we consider the checksum cycle complete, 
change pg_control's "data_checksum_state" to "enforcing" and consider things 
fully up-to-date.

If it finds a database needing work, it iterates through that database's 
relations looking for "rellastchecksum < data_checksum_cycle".  If it finds 
none (i.e., every record has rellastchecksum == data_checksum_cycle) then it 
marks the containing database as up-to-date by updating "datlastchecksum = 
data_checksum_cycle".

For any relation that it finds in the database which is not checksummed, it 
starts an actual worker to handle the checksum process for this table.  Since 
the state of the cluster is already either "enforcing" or "revalidating", any 
block writes will get checksums added automatically, so the only thing the 
bgworker needs to do is load each block in the relation and explicitly mark as 
dirty (unless that's not required for FlushBuffer() to do its thing).  After 
every block in the relation is visited this way and checksummed, its pg_class 
record will have "rellastchecksum" updated.

** Function API:

Interface to the functionality will be via the following Utility functions:

  - pg_enable_checksums(void) => turn checksums on for a cluster.  Will error 
if the state is anything but "disabled".  If this is the first time this 
cluster has run this, this will initialize ControlFile->data_checksum_version 
to the preferred built-in algorithm (since there's only one currently, we just 
set it to 1).  This increments the ControlFile->data_checksum_cycle variable, 
then sets the state to "enabling", which means that the next time the bgworker 
checks if there is anything to do it will see that state,  scan all the 
databases' "datlastchecksum" fields, and start kicking off the bgworker 
processes to handle the checksumming of the actual relation files.

  - pg_disable_checksums(void) => turn checksums off for a cluster.  Sets the 
state to "disabled", which means bg_worker will not do anything.

  - pg_request_checksum_cycle(void) => if checksums are "enabled", increment 
the data_checksum_cycle counter and set the state to "enabling".  (Alterantely, 
if we use the "revalidate" state here we could ensure that existing checksums 
are validated on read to alert us of any blocks with problems.  This could also 
be made to be "smart" i.e., interrupt existing running checksum cycle to kick 
off another one (not sure of the use case), effectively call 
pg_enable_checksums() if the cluster has not been explictly enabled before, 
etc; depends on how pedantic we want to be.

** Design notes/implications:

When the system is in one of the modes which write checksums (currently 
everything but "disabled") any new relations/databases will have their 
"rellastchecksum"/"datlastchecksum" counters prepopulated with the current 
value of "data_checksum_counter", as we know that any space used for these 
relations will be checksummed, and hence valid.  By pre-setting this, we remove 
the need for the checksum bgworker to explicitly visit these new relations and 
force checksums which will already be valid.

With checksums on, we know any full-heap-modifying operations will be properly 
checksummed, we may be able to pre-set rellastchecksum for other operations 
such as ALTER TABLEs which trigger a full rewrite *without* having to 
explicitly have the checksum bgworker run on this.  I suspect there are a 
number of other places which may lend themselves to optimization like this to 
avoid having to process relations explicitly.  (Say, if we somehow were able to 
force a checksum operation on any full SeqScan and update the state after the 
fact, we'd avoid paying this penalty another time.)

** pg_upgrade:

Milestone 2 in this patch is adding support for pg_upgrade.

With this additional complexity, we need to consider pg_upgrade, both now and 
in future versions.  For one thing, we need to transfer settings from 
pg_control, plus make sure that pg_upgrade accepts deviances in any of the 
data_checksum-related settings in pg_control.

4 scenarios to consider if/what to allow:

*** non-checksummed -> non-checksummed

exactly as it stands now

*** checksummed -> non-checksummed

pretty trivial; since the system tables will be non-checksummed, just 
equivalent to resetting the checksum_cycle and pg_control fields; user data 
files will be copied or linked into place with the checksums, but since it is 
disbled they will be ignored.

*** non-checksummed -> checksummed

For the major version this patch makes it into, this will likely be the primary 
use case; add an --enable-checksums option to `pg_upgrade` to initially set the 
new cluster state to the checksums_enabling state, pre-init the system 
databases with the correct state and checksum cycle flag.

*** checksummed -> checksummed

The potentially tricky case (but likely to be more common going forward as 
incremental checksums are supported).

Since we may have had checksum cycles in process in the old cluster or 
otherwise had the checksum counter we need to do the following:

- need to propagate data_checksum_state, data_checksum_cycle, and 
data_checksum_version.  If we wanted to support a different CRC algorithm, we 
could pre-set the data_checksum_version to a different version here, increment 
data_checksum_cycle, and set data_checksum_state to either "enabling" or 
"revalidating", depending on the original state from the old cluster.  (i.e., 
if we were in the middle of an initial checksum cycle (state == "enabling").

- new cluster's system tables may need to have the "rellastchecksum" and 
"datlastchecksum" settings saved from the previous system, if that's easy, to 
avoid a fresh checksum run if there is no need.

** Handling checksums on a standby:

How to handle checksums on a standby is a bit trickier since checksums are 
inherently a local cluster state and not WAL logged but we are storing state in 
the system tables for each database.

In order to manage this discrepency, we WAL log a few additional pieces of 
information; specifically:

- new events to capture/propogate any of the pg_control fields, such as: 
checksum version data, checksum cycle increases, enabling/disabling actions

- checksum background worker block ranges.

Some notes on the block ranges: This would effectively be a series of records 
containing (datid, relid, start block, end block) for explicit checksum ranges, 
generated by the checksum bgworker as it checksums individual relations.  This 
could be broken up into a series of blocks so rather than having the 
granularity be by relation we could have these records get generated 
periodicaly (say in groups of 10K blocks or whatever, number to be determined) 
to allow standby checksum recalculation to be incremental so as not to delay 
replay unnecessarily as checksums are being created.

Since the block range WAL records will be replayed before any of the 
pg_class/pg_database catalog records are replay, we'll be guaranteed to have 
the checksums calculated on the standby by the time it appears valid due to 
system state.

We may also be able to use the WAL records to speed up the processing of 
existing heap files if they are interrupted for some reason, this remains to be 
seen.

** Testing changes:

We need to add separate initdb checksum regression test which are outside of 
the normal pg_regress framework.

** Roadmap:

- Milestone 1 (master support) [0/7]
  - [ ] pg_control updates for new data_checksum_cycle, data_checksum_state
  - [ ] pg_class changes
  - [ ] pg_database changes
  - [ ] function API
  - [ ] autovac launcher modifications
  - [ ] checksum bgworker
  - [ ] doc updates
- Milestone 2 (pg_upgrade support) [0/4]
  - [ ] no checksum -> no checksum
  - [ ] checksum -> no checksum
  - [ ] no checksum -> checksum
  - [ ] checksum -> checksum
- Milestone 3 (standby support) [0/4]
  - [ ] WAL log checksum cycles
  - [ ] WAL log enabling/disabling checksums
  - [ ] WAL log checksum block ranges
  - [ ] Add standby WAL replay

I look forward to any feedback; thanks!

David
--
David Christensen
PostgreSQL Team Manager
End Point Corporation
da...@endpoint.com
785-727-1171







-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to