[HACKERS] Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

Craig Ringer Tue, 17 Jul 2012 17:14:09 -0700

On 07/18/2012 06:56 AM, Tom Lane wrote:

Robert Haas <robertmh...@gmail.com> writes:

On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

BTW, while we are on the subject: hasn't this split completely broken
the statistics about backend-initiated writes?

Yes, it seems to have done just that.

So I went to fix this in the obvious way (attached), but while testing
it I found that the number of buffers_backend events reported during
a regression test run barely changed; which surprised the heck out of
me, so I dug deeper.  The cause turns out to be extremely scary:
ForwardFsyncRequest isn't getting called at all in the bgwriter process,
because the bgwriter process has a pendingOpsTable.  So it just queues
its fsync requests locally, and then never acts on them, since it never
runs any checkpoints anymore.


This implies that nobody has done pull-the-plug testing on either HEAD
or 9.2 since the checkpointer split went in (2011-11-01)

That makes me wonder if on top of the buildfarm, extending somebuildfarm machines into a "crashfarm" is needed:

- Keep kvm instances with copy-on-write snapshot disks and the build envon them

- Fire up the VM, do a build, and start the server

- From outside the vm have the test controller connect to the server andstart a test run

- Hard-kill the OS instance at a random point in time.
- Start the OS instance back up
- Start Pg back up and connect to it again

- From the test controller, test the Pg install for possible corruptionby reading the indexes and tables, doing some test UPDATEs, etc.

The main challenge would be coming up with suitable tests to run, onesthat could then be checked to make sure nothing was broken. The testcontroller would know how far a test got before the OS got killed andwould know which test it was running, so it'd be able to check forexpected data if provided with appropriate test metadata. Use of enable_flags should permit scans of indexes and table heaps to be forced.

What else should be checked? The main thing that comes to mind for me issomething I've worried about for a while: that Pg might not alwayshandle out-of-disk-space anywhere near as gracefully as it's oftenclaimed to. There's no automated testing for that, so it's hard toreally know. A harnessed VM could be used to test that. Instead ofvirtual plug pull tests it could generate a virtual disk of constrainedrandom size, run its tests until out-of-disk caused failure, stop Pg,expand the disk, restart Pg, and run its checks.

Variants where WAL was on a separate disk and only WAL or only the mainnon-WAL disk run out of space would also make sense and be easy toproduce with such a harness.

I've written some automated kvm test harnesses, so I could have a playwith this idea. I would probably need some help with the test design,though, and the guest OS would be Linux, Linux, or Linux at least tostart with.


Opinions?

--
Craig Ringer

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

Reply via email to