As discussed previously about WAL Re-Writes [1], I have done some investigation in that area which I would like to share.
Currently we always write WAL in 8KB blocks, which could lead to a lot of re-write of data for small-transactions. Consider the case where the amount to be written is usually < 4KB, we always write it in 8KB chunks which is the major source of re-writes. I have tried various options to reduce this re-write of data. First, I have tried to write the WAL in exact size which the transaction or otherwise has been requested to XLogWrite(), patch for this experiment is attached. I have written a small test patch (calculate_wal_written_by_backend_v1.patch) to calculate the amount of WAL written by XLogWrite() API and found that the actual WAL-writes by PostgreSQL have reduced by half with patch on (pgbench tpc-b workload with 4-clients), but unfortunately this lead to significant decrease (more than 50%) in TPS. Jan Wieck has then found out by observing OS stats </sys/block/<devname>/stat> that this patch has reduced the writes, but introduced reads and probable theory behind the same is that as the patch is not writing in block boundaries of OS, OS has to read the block to complete this write operation. Now why OS couldn't find the corresponding block in memory is that, while closing the WAL file, we use POSIX_FADV_DONTNEED if wal_level is less than 'archive' which lead to this problem. So with this experiment, the conclusion is that though we can avoid re-write of WAL data by doing exact writes, but it could lead to significant reduction in TPS. Then, I have tried by writing the WAL in chunks and introduced a guc wal_write_chunk_size to experiment with different chunk sizes (patch - write_wal_chunks_v1.patch). I have noticed that at 4096 bytes chunk size (OS block size), there is approximately 35% reduction in WAL writes (at different client counts for pgbench read-write workload) both by using my test patch calculate_wal_written_by_backend_v1.patch and by observing OS stats </sys/block/<devname>/stat>. Now where I see a good amount of reduction in WAL writes, but the TPS increase is between 1~5% for read-write workloads. In some cases at lower client-count (4), I have seen increase upto 10~15% across multiple runs, but didn't find a clear trend which can suggest that at lower- client counts it will always be such a good improvement, OTOH I have not observed any regression with 4096 bytes WAL chunk size in my tests till now. One likely theory that we might not see much improvement at high client count is due to the logic in XLogFlush() where we combine the WAL writes from multiple clients and the combined size is greater than 4096 bytes in which case it will write 8K blocks. For all other chunk sizes 512 bytes, 1024 bytes, 2048 bytes, I observed that the smaller the chunk size, better is reduction in WAL writes, but trend for TPS is just opposite (lesser the chunk size, worse is TPS) and the probable reason is same as explained in previous paragraph. Thoughts? Note - 1. OS level reduction for WAL writes is done by having WAL and data on separate disks. 2. I can share the detailed performance data if required, but I thought it is better to first share the Approach of patch. 3. Patches are more of a Proof-of-concept stage, rather than real implementation, but I think it won't need too much effort to improve it, if we find any particular approach as an acceptable approach. [1] - http://www.postgresql.org/message-id/CA+TgmobWdBcbuipWPsbHSbf+-KDmatnYQYZ=akaju6alb5m...@mail.gmail.com With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
avoid_extra_walwrites_v1.patch
Description: Binary data
calculate_wal_written_by_backend_v1.patch
Description: Binary data
write_wal_chunks_v1.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers