The basic idea, like before, is to split WAL insertion into two phases:
1. Reserve the right amount of WAL. This is done while holding just a spinlock. Thanks to the changes I made earlier to the WAL format, the space calculations are now much simpler and the critical section boils down to almost just "CurBytePos += size_of_wal_record". See ReserveXLogInsertLocation() function.
2. Copy the WAL record to the right location in the WAL buffers. This slower part can be done mostly in parallel.
The difficult part is tracking which insertions are currently in progress, and being able to wait for an insertion to finish copying the record data in place. I'm using a small number (7 at the moment) of WAL insertion slots for that. The first thing that XLogInsert does is to grab one of the slots. Each slot is protected by a LWLock, and XLogInsert reserves a slot by acquiring its lock. It holds the lock until it has completely finished copying the WAL record in place. In each slot, there's an XLogRecPtr that indicates how far the current inserter has progressed with its insertion. Typically, for a short record that fits on a single page, it is updated after the insertion is finished, but if the insertion needs to wait for a WAL buffer to become available, it updates the XLogRecPtr before sleeping.
To wait for all insertions up to a point to finish, you scan all the insertion slots, and observe that the XLogRecPtrs in them are >= the point you're interested in. The number of slots is a tradeoff: more slots allow more concurrency in inserting records, but makes it slower to determine how far it can be safely flushed.
I did some performance tests with this, on an 8-core HP Proliant server, in a VM running under VMware vSphere 5.1. The tests were performed with Greg Smith's pgbench-tools kit, with one of two custom workload scripts:
1. Insert 1000 rows in each transaction. This is exactly the sort of workload where WALInsertLock currently becomes a bottleneck. Without the the patch, the test scales very badly, with about 420 TPS with a single client, peaking only at 520 TPS with two clients. With the patch, it scales up to about 1200 TPS, with 7 clients. I believe the test becomes I/O limited at that point; looking at iostat output while the test is running shows about 200MB/s of writes, and that is roughly what the I/O subsystem of this machine can do, according to a simple test with 'dd ...; sync". Or perhaps having more "insertion slots" would allow it to go higher - the patch uses exactly 7 slots at the moment.
http://hlinnaka.iki.fi/xloginsert-scaling/results-1k/2. Insert only 10 rows in each transaction. This simulates an OLTP workload with fairly small transactions. The patch doesn't make a huge difference with that workload. It performs somewhat worse with 4-16 clients, but then somewhat better with > 16 clients. The patch adds some overhead to flushing the WAL, I believe that's what's causing the slowdown with 4-16 clients. But with more clients, the WALInsertLock bottleneck becomes more significant, and you start to see a benefit again.
http://hlinnaka.iki.fi/xloginsert-scaling/results-10/Overall, the results look pretty good. I'm going to take a closer look at the slowdown in the second test. I think it might be fixable with some changes to how WaitInsertionsToFinish() and WALWriteLock work together, although I'm not sure how exactly it ought to work.
Comments, ideas? - Heikki
xloginsert-scale-20.patch.gz
Description: GNU Zip compressed data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers