Re: BitmapHeapScan streaming read user and prelim refactoring

Thomas Munro Thu, 28 Mar 2024 14:20:02 -0700

On Fri, Mar 29, 2024 at 7:01 AM Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:
> On 3/28/24 06:20, Thomas Munro wrote:
> > With the unexplained but apparently somewhat systematic regression
> > patterns on certain tests and settings, I wonder if they might be due
> > to read_stream.c trying to form larger reads, making it a bit lazier.
> > It tries to see what the next block will be before issuing the
> > fadvise.  I think that means that with small I/O concurrency settings,
> > there might be contrived access patterns where it loses, and needs
> > effective_io_concurrency to be set one notch higher to keep up, or
> > something like that.
>
> Yes, I think we've speculated this might be the root cause before, but
> IIRC we didn't manage to verify it actually is the problem.


Another factor could be the bug in master that allows it to get out of
sync -- can it allow *more* concurrency than it intended to?  Or fewer
hints, but somehow that goes faster because of the
stepping-on-kernel-toes problem?

> > One way to test that idea would be to run the
> > tests with io_combine_limit = 1 (meaning 1 block).  It issues advise
> > eagerly when io_combine_limit is reached, so I suppose it should be
> > exactly as eager as master.  The only difference then should be that
> > it automatically suppresses sequential fadvise calls.
>
> Sure, I'll give that a try. What are some good values to test? Perhaps
> 32 and 1, i.e. the default and "no coalescing"?

Thanks!  Yeah.  The default is actually 16, computed backwards from
128kB.  (Explanation: POSIX requires 16 as minimum IOV_MAX, ie number
of vectors acceptable to writev/readv and related functions, though
actual acceptable number is usually much higher, and it also seems to
be a conservative acceptable number for hardware scatter/gather lists
in various protocols, ie if doing direct I/O, the transfer won't be
chopped up into more than one physical I/O command because the disk
and DMA engine can handle it as a single I/O in theory at least.
Actual limit on random SSDs might be more like 33, a weird number but
that's what I'm seeing; Mr Axboe wrote a nice short article[1] to get
some starting points for terminology on that topic on Linux.  Also,
just anecdotally, returns seem to diminish after that with huge
transfers of buffered I/O so it seems like an OK number if you have to
pick one; but IDK, YMMV, subject for future research as direct I/O
grows in relevance, hence GUC.)

> If this turns out to be the problem, does that mean we would consider
> using a more conservative default value? Is there some "auto tuning" we
> could do? For example, could we reduce the value combine limit if we
> start not finding buffers in memory, or something like that?

Hmm, not sure... I like that number for seq scans.  I also like
auto-tuning.  But it seems to me that *if* the problem is that we're
not allowing ourselves as many concurrent I/Os as master BHS because
we're waiting to see if the next block is consecutive, that might
indicate that the distance needs to be higher so that we can have a
better chance to see the 'edge' (the non-contiguous next block) and
start the I/O, not that the io_combine_limit needs to be lower.  But I
could be way off, given the fuzziness on this problem so far...

> Anyway, doesn't the combine limit work against the idea that
> effective_io_concurrency is "prefetch distance"? With eic=32 I'd expect
> we issue prefetch 32 pages ahead, i.e. if we prefetch page X, we should
> then process 32 pages before we actually need X (and we expect the page
> to already be in memory, thanks to the gap). But with the combine limit
> set to 32, is this still true?

Hmm.  It's different.  (1) Master BHS has prefetch_maximum, which is
indeed directly taken from the eic setting, while read_stream.c is
prepared to look much ahead further than that (potentially as far as
max_pinned_buffers) if it's been useful recently, to find
opportunities to coalesce and start I/O.  (2) Master BHS has
prefetch_target to control the look-ahead window, which starts small
and ramps up until it hits prefetch_maximum, while read_stream.c has
distance which goes up and down according to a more complex algorithm
described at the top.

> I've tried going through read_stream_* to determine how this will
> behave, but read_stream_look_ahead/read_stream_start_pending_read does
> not make this very clear. I'll have to experiment with some tracing.

I'm going to try to set up something like your experiment here too,
and figure out some way to visualise or trace what's going on...

The differences come from (1) multi-block I/Os, requiring two separate
numbers: how many blocks ahead we're looking, and how many I/Os are
running, and (2) being more aggressive about trying to reach the
desired I/O level.  Let me try to describe the approach again.
"distance" is the size of a window that we're searching for
opportunities to start I/Os.  read_stream_look_ahead() will keep
looking ahead until we already have max_ios I/Os running, or we hit
the end of that window.  That's the two conditions in the while loop
at the top:

    while (stream->ios_in_progress < stream->max_ios &&
               stream->pinned_buffers + stream->pending_read_nblocks <
stream->distance)

If that window is not large enough, we won't be able to find enough
I/Os to reach max_ios.  So, every time we finish up starting a random
(non-sequential) I/O, we increase the distance, widening the window
until it can hopefully reach the I/O goal.  (I call that behaviour C
in the comments and code.)  In other words, master BHS can only find
opportunities to start I/Os in a smaller window, and can only reach
the full I/O concurrency target if they are right next to each other
in that window, but read_stream.c will look much further ahead, but
only if that has recently proven to be useful.

If we find I/Os that need doing, but they're all sequential, the
window size moves towards io_combine_limit, because we know that
issuing advice won't help, so there is no point in making the window
wider than one maximum-sized I/O.  For example, sequential scans or
bitmap heapscans with lots of consecutive page bits fall into this
pattern.  (Behaviour B in the code comments.)  This is a pattern that
master BHS doesn't have anything like.

If we find that we don't need to do any I/O, we slowly move the window
size towards 1 (also the initial value) as there is no point in doing
anything special as it can't help.  In contrast, master BHS never
shrinks its prefetch_target, it only goes up until it hits eic.

[1] https://kernel.dk/when-2mb-turns-into-512k.pdf

Re: BitmapHeapScan streaming read user and prelim refactoring

Reply via email to