On 2016-02-21 08:26:28 +0100, Fabien COELHO wrote: > >>In the discussion in the wal section, I'm not sure about the effect of > >>setting writebacks on SSD, [...] > > > >Yea, that paragraph needs some editing. I think we should basically > >remove that last sentence. > > Ok, fine with me. Does that mean that flushing as a significant positive > impact on SSD in your tests?
Yes. The reason we need flushing is that the kernel amasses dirty pages, and then flushes them at once. That hurts for both SSDs and rotational media. Sorting is the the bigger question, but I've seen it have clearly beneficial performance impacts. I guess if you look at devices with a internal block size bigger than 8k, you'd even see larger differences. > >>Maybe the merging strategy could be more aggressive than just strict > >>neighbors? > > > >I don't think so. If you flush more than neighbouring writes you'll > >often end up flushing buffers dirtied by another backend, causing > >additional stalls. > > Ok. Maybe the neightbor definition could be relaxed just a little bit so > that small holes are overtake, but not large holes? If there is only a few > pages in between, even if written by another process, then writing them > together should be better? Well, this can wait for a clear case, because > hopefully the OS will recoalesce them behind anyway. I'm against doing so without clear measurements of a benefit. > >Also because the infrastructure is used for more than checkpoint > >writes. There's absolutely no ordering guarantees there. > > Yep, but not much benefit to expect from a few dozens random pages either. Actually, there's kinda frequently a benefit observable. Even if few requests can be merged, doing IO requests in an order more likely doable within a few rotations is beneficial. Also, the cost is marginal, so why worry? > >>[...] I do think that this whole writeback logic really does make > >>sense *per table space*, > > > >Leads to less regular IO, because if your tablespaces are evenly sized > >(somewhat common) you'll sometimes end up issuing sync_file_range's > >shortly after each other. For latency outside checkpoints it's > >important to control the total amount of dirty buffers, and that's > >obviously independent of tablespaces. > > I do not understand/buy this argument. > > The underlying IO queue is per device, and table spaces should be per device > as well (otherwise what the point?), so you should want to coalesce and > "writeback" pages per device as wel. Calling sync_file_range on distinct > devices should probably be issued more or less randomly, and should not > interfere one with the other. The kernel's dirty buffer accounting is global, not per block device. It's also actually rather common to have multiple tablespaces on a single block device. Especially if SANs and such are involved; where you don't even know which partitions are on which disks. > If you use just one context, the more table spaces the less performance > gains, because there is less and less aggregation thus sequential writes per > device. > > So for me there should really be one context per tablespace. That would > suggest a hashtable or some other structure to keep and retrieve them, which > would not be that bad, and I think that it is what is needed. That'd be much easier to do by just keeping the context in the per-tablespace struct. But anyway, I'm really doubtful about going for that; I had it that way earlier, and observing IO showed it not being beneficial. > >>For the checkpointer, a key aspect is that the scheduling process goes > >>to sleep from time to time, and this sleep time looked like a great > >>opportunity to do this kind of flushing. You choose not to take advantage > >>of the behavior, why? > > > >Several reasons: Most importantly there's absolutely no guarantee that > >you'll ever end up sleeping, it's quite common to happen only seldomly. > > Well, that would be under a situation when pg is completely unresponsive. > More so, this behavior *makes* pg unresponsive. No. The checkpointer being bottlenecked on actual IO performance doesn't impact production that badly. It'll just sometimes block in sync_file_range(), but the IO queues will have enough space to frequently give way to other backends, particularly to synchronous reads (most pg reads) and synchronous writes (fdatasync()). So a single checkpoint will take a bit longer, but otherwise the system will mostly keep up the work in a regular manner. Without the sync_file_range() calls the kernel will amass dirty buffers until global dirty limits are reached, which then will bring the whole system to a standstill. It's pretty common that checkpoint_timeout is too short to be able to write all shared_buffers out, in that case it's much better to slow down the whole checkpoint, instead of being incredibly slow at the end. > >I also don't really believe it helps that much, although that's a complex > >argument to make. > > Yep. My thinking is that doing things in the sleeping interval does not > interfere with the checkpointer scheduling, so it is less likely to go wrong > and falling behind. I don't really see why that's the case. Triggering writeback every N writes doesn't really influence the scheduling in a bad way - the flushing is done *before* computing the sleep time. Triggering the writeback *after* computing the sleep time, and then sleep for that long, in addition of the time for sync_file_range, skews things more. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers