On 07/18/2012 06:56 AM, Tom Lane wrote:
Robert Haas <robertmh...@gmail.com> writes:
On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
BTW, while we are on the subject: hasn't this split completely broken
the statistics about backend-initiated writes?
Yes, it seems to have done just that.
So I went to fix this in the obvious way (attached), but while testing
it I found that the number of buffers_backend events reported during
a regression test run barely changed; which surprised the heck out of
me, so I dug deeper.  The cause turns out to be extremely scary:
ForwardFsyncRequest isn't getting called at all in the bgwriter process,
because the bgwriter process has a pendingOpsTable.  So it just queues
its fsync requests locally, and then never acts on them, since it never
runs any checkpoints anymore.

This implies that nobody has done pull-the-plug testing on either HEAD
or 9.2 since the checkpointer split went in (2011-11-01)

That makes me wonder if on top of the buildfarm, extending some buildfarm machines into a "crashfarm" is needed:

- Keep kvm instances with copy-on-write snapshot disks and the build env on them
- Fire up the VM, do a build, and start the server
- From outside the vm have the test controller connect to the server and start a test run
- Hard-kill the OS instance at a random point in time.
- Start the OS instance back up
- Start Pg back up and connect to it again
- From the test controller, test the Pg install for possible corruption by reading the indexes and tables, doing some test UPDATEs, etc.

The main challenge would be coming up with suitable tests to run, ones that could then be checked to make sure nothing was broken. The test controller would know how far a test got before the OS got killed and would know which test it was running, so it'd be able to check for expected data if provided with appropriate test metadata. Use of enable_ flags should permit scans of indexes and table heaps to be forced.

What else should be checked? The main thing that comes to mind for me is something I've worried about for a while: that Pg might not always handle out-of-disk-space anywhere near as gracefully as it's often claimed to. There's no automated testing for that, so it's hard to really know. A harnessed VM could be used to test that. Instead of virtual plug pull tests it could generate a virtual disk of constrained random size, run its tests until out-of-disk caused failure, stop Pg, expand the disk, restart Pg, and run its checks.

Variants where WAL was on a separate disk and only WAL or only the main non-WAL disk run out of space would also make sense and be easy to produce with such a harness.

I've written some automated kvm test harnesses, so I could have a play with this idea. I would probably need some help with the test design, though, and the guest OS would be Linux, Linux, or Linux at least to start with.

Opinions?

--
Craig Ringer

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to