Hi, I'd like to propose that we increase the default WAL segment size, which is currently 16MB. It was first set to that value in commit 47937403676d913c0e740eec6b85113865c6c8ab in October of 1999; prior to that, it was 64MB. Between 1999 and now, there have been three significant changes that make me think it might be time to rethink this value:
1. Transaction rates are vastly higher these days. In 1999, I think we were still limited to ~2^32 transactions during the entire lifetime of the server; transaction ID wraparound hadn't been invented yet.[1] Today, some installations do that many write transactions in under a week. The practical consequence of this is that WAL files fill up in extremely short periods of time. Some users generate multiple terabytes of WAL per day, which means they are generating - and very likely archiving - WAL files a rate of greater than 1 per second! That poses multiple problems. For example, if your archive command happens to involve ssh, you might run into trouble because of this sort of thing: [rhaas pgsql]$ /usr/bin/time ssh hydra true 1.57 real 0.00 user 0.00 sys Also, your operating system's implementation of directories and the commands to work with them (like ls) don't necessarily scale well to tens or hundreds of thousands of archived files. Furthermore, there is an enforced, synchronous fsync at the end of every segment, which actually does hurt performance on write-heavy workloads.[2] Of course, if that were the only reason to consider increasing the segment size, it would probably make more sense to just try to push that extra fsync into the background, but that's not really the case. From what I hear, the gigantic number of files is a bigger pain point. 2. Disks are a bit larger these days. In the worst case, we waste just under twice as much space as whatever the segment size is: you might need 1 byte from the oldest segment you're keeping and 1 byte from the newest segment that you are keeping, but not the remaining contents of either file. In 1999, trying to limit disk wastage to <32MB probably seemed reasonable, but today that's very little disk space. I think at that time typical hard drive sizes were around 10 GB, whereas today they are around 1 TB.[3] I'm not sure whether the size of the sorts of high-performance storage that is likely to be used for pg_xlog has grown as fast as hard drives generally, but even so it seems pretty clear to me that trying to limit disk wastage to 32MB is excessively conservative on modern hardware. 3. archive_timeout is no longer a frequently used option. Obviously, if you are frequently archiving partial segments, you don't want the segment size to be too large, because if it is, each forced segment switch potentially wastes a large amount of space (and bandwidth). But given streaming replication and pg_receivexlog, the use case for archiving partial segments is, at least according to my understanding, a lot narrower than it used to be. So, I think we don't have to worry as much about keeping forced segment switches cheap as we did during the 8.x series. Considering those three factors, I think we should consider pushing the default value up somewhat higher for v10. Reverting to the 64MB size that we had prior to 47937403676d913c0e740eec6b85113865c6c8ab sounds pretty reasonable. Users with really high transaction rates might even prefer a higher value (e.g. 256MB, 1GB) but that's hardly practical for small installs given our default of max_wal_size = 1GB. Possibly it would make sense for this to be configurable at initdb time instead of requiring a recompile; we probably don't save any significant number of cycles by compiling this into the server. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] I believe at that time we consumed an XID even for a read-only transaction, too; today, we can do 2^32 read transactions in a few hours. [2] Amit did some benchmarking on this, I believe, but I don't have the numbers handy. [3] https://commons.wikimedia.org/wiki/File:Hard_drive_capacity_over_time.png -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers