On Fri, Mar 29, 2024 at 7:01 AM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > On 3/28/24 06:20, Thomas Munro wrote: > > With the unexplained but apparently somewhat systematic regression > > patterns on certain tests and settings, I wonder if they might be due > > to read_stream.c trying to form larger reads, making it a bit lazier. > > It tries to see what the next block will be before issuing the > > fadvise. I think that means that with small I/O concurrency settings, > > there might be contrived access patterns where it loses, and needs > > effective_io_concurrency to be set one notch higher to keep up, or > > something like that. > > Yes, I think we've speculated this might be the root cause before, but > IIRC we didn't manage to verify it actually is the problem.
Another factor could be the bug in master that allows it to get out of sync -- can it allow *more* concurrency than it intended to? Or fewer hints, but somehow that goes faster because of the stepping-on-kernel-toes problem? > > One way to test that idea would be to run the > > tests with io_combine_limit = 1 (meaning 1 block). It issues advise > > eagerly when io_combine_limit is reached, so I suppose it should be > > exactly as eager as master. The only difference then should be that > > it automatically suppresses sequential fadvise calls. > > Sure, I'll give that a try. What are some good values to test? Perhaps > 32 and 1, i.e. the default and "no coalescing"? Thanks! Yeah. The default is actually 16, computed backwards from 128kB. (Explanation: POSIX requires 16 as minimum IOV_MAX, ie number of vectors acceptable to writev/readv and related functions, though actual acceptable number is usually much higher, and it also seems to be a conservative acceptable number for hardware scatter/gather lists in various protocols, ie if doing direct I/O, the transfer won't be chopped up into more than one physical I/O command because the disk and DMA engine can handle it as a single I/O in theory at least. Actual limit on random SSDs might be more like 33, a weird number but that's what I'm seeing; Mr Axboe wrote a nice short article[1] to get some starting points for terminology on that topic on Linux. Also, just anecdotally, returns seem to diminish after that with huge transfers of buffered I/O so it seems like an OK number if you have to pick one; but IDK, YMMV, subject for future research as direct I/O grows in relevance, hence GUC.) > If this turns out to be the problem, does that mean we would consider > using a more conservative default value? Is there some "auto tuning" we > could do? For example, could we reduce the value combine limit if we > start not finding buffers in memory, or something like that? Hmm, not sure... I like that number for seq scans. I also like auto-tuning. But it seems to me that *if* the problem is that we're not allowing ourselves as many concurrent I/Os as master BHS because we're waiting to see if the next block is consecutive, that might indicate that the distance needs to be higher so that we can have a better chance to see the 'edge' (the non-contiguous next block) and start the I/O, not that the io_combine_limit needs to be lower. But I could be way off, given the fuzziness on this problem so far... > Anyway, doesn't the combine limit work against the idea that > effective_io_concurrency is "prefetch distance"? With eic=32 I'd expect > we issue prefetch 32 pages ahead, i.e. if we prefetch page X, we should > then process 32 pages before we actually need X (and we expect the page > to already be in memory, thanks to the gap). But with the combine limit > set to 32, is this still true? Hmm. It's different. (1) Master BHS has prefetch_maximum, which is indeed directly taken from the eic setting, while read_stream.c is prepared to look much ahead further than that (potentially as far as max_pinned_buffers) if it's been useful recently, to find opportunities to coalesce and start I/O. (2) Master BHS has prefetch_target to control the look-ahead window, which starts small and ramps up until it hits prefetch_maximum, while read_stream.c has distance which goes up and down according to a more complex algorithm described at the top. > I've tried going through read_stream_* to determine how this will > behave, but read_stream_look_ahead/read_stream_start_pending_read does > not make this very clear. I'll have to experiment with some tracing. I'm going to try to set up something like your experiment here too, and figure out some way to visualise or trace what's going on... The differences come from (1) multi-block I/Os, requiring two separate numbers: how many blocks ahead we're looking, and how many I/Os are running, and (2) being more aggressive about trying to reach the desired I/O level. Let me try to describe the approach again. "distance" is the size of a window that we're searching for opportunities to start I/Os. read_stream_look_ahead() will keep looking ahead until we already have max_ios I/Os running, or we hit the end of that window. That's the two conditions in the while loop at the top: while (stream->ios_in_progress < stream->max_ios && stream->pinned_buffers + stream->pending_read_nblocks < stream->distance) If that window is not large enough, we won't be able to find enough I/Os to reach max_ios. So, every time we finish up starting a random (non-sequential) I/O, we increase the distance, widening the window until it can hopefully reach the I/O goal. (I call that behaviour C in the comments and code.) In other words, master BHS can only find opportunities to start I/Os in a smaller window, and can only reach the full I/O concurrency target if they are right next to each other in that window, but read_stream.c will look much further ahead, but only if that has recently proven to be useful. If we find I/Os that need doing, but they're all sequential, the window size moves towards io_combine_limit, because we know that issuing advice won't help, so there is no point in making the window wider than one maximum-sized I/O. For example, sequential scans or bitmap heapscans with lots of consecutive page bits fall into this pattern. (Behaviour B in the code comments.) This is a pattern that master BHS doesn't have anything like. If we find that we don't need to do any I/O, we slowly move the window size towards 1 (also the initial value) as there is no point in doing anything special as it can't help. In contrast, master BHS never shrinks its prefetch_target, it only goes up until it hits eic. [1] https://kernel.dk/when-2mb-turns-into-512k.pdf