As discussed previously about WAL Re-Writes [1],  I have
done some investigation in that area which I would like to

Currently we always write WAL in 8KB blocks, which could
lead to a lot of re-write of data for small-transactions. Consider
the case where the amount to be written is usually < 4KB,
we always write it in 8KB chunks which is the major source
of re-writes. I have tried various options to reduce this re-write
of data.

First, I have tried to write the WAL in exact size which the
transaction or otherwise has been requested to XLogWrite(),
patch for this experiment is attached.  I have written a small test
patch (calculate_wal_written_by_backend_v1.patch) to calculate the
amount of WAL written by XLogWrite() API and found that the actual
WAL-writes by PostgreSQL have reduced by half with patch on
(pgbench tpc-b workload with 4-clients), but unfortunately this lead
to significant decrease (more than 50%) in TPS.  Jan Wieck has then
found out by observing OS stats </sys/block/<devname>/stat> that
this patch has reduced the writes, but introduced reads and probable
theory behind the same is that as the patch is not writing in block
boundaries of OS, OS has to read the block to complete this write
operation.  Now why OS couldn't find the corresponding block in
memory is that, while closing the WAL file, we use
POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
lead to this problem.  So with this experiment, the conclusion is that
though we can avoid re-write of WAL data by doing exact writes, but
it could lead to significant reduction in TPS.

Then, I have tried by writing the WAL in chunks and introduced
a guc wal_write_chunk_size to experiment with different chunk
sizes (patch - write_wal_chunks_v1.patch).  I have noticed that
at 4096 bytes chunk size (OS block size), there is approximately 35%
reduction in WAL writes (at different client counts for pgbench
read-write workload) both by using my test patch
calculate_wal_written_by_backend_v1.patch and by observing
OS stats </sys/block/<devname>/stat>.  Now where I see a good
amount of reduction in WAL writes, but the TPS increase
is between 1~5% for read-write workloads.  In some cases at lower
client-count (4), I have seen increase upto 10~15% across multiple
runs, but didn't find a clear trend which can suggest that at lower-
client counts it will always be such a good improvement, OTOH
I have not observed any regression with 4096 bytes WAL chunk size
in my tests till now.  One likely theory that we might not see much
improvement at high client count is due to the logic in XLogFlush()
where we combine the WAL writes from multiple clients and the
combined size is greater than 4096 bytes in which case it will write
8K blocks.  For all other chunk sizes 512 bytes, 1024 bytes,
2048 bytes, I observed that the smaller the chunk size, better is
reduction in WAL writes, but trend for TPS is just opposite (lesser
the chunk size, worse is TPS) and the probable reason is same as
explained in previous paragraph.


Note -
1. OS level reduction for WAL writes is done by having WAL
and data on separate disks.
2. I can share the detailed performance data if required, but
I thought it is better to first share the Approach of patch.
3. Patches are more of a Proof-of-concept stage, rather than real
implementation, but I think it won't need too much effort to
improve it, if we find any particular approach as an acceptable

[1] -

With Regards,
Amit Kapila.

Attachment: avoid_extra_walwrites_v1.patch
Description: Binary data

Attachment: calculate_wal_written_by_backend_v1.patch
Description: Binary data

Attachment: write_wal_chunks_v1.patch
Description: Binary data

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to