O (WIP)

Thomas Munro Wed, 27 Mar 2024 07:11:45 -0700

On Wed, Mar 27, 2024 at 1:40 AM Heikki Linnakangas <hlinn...@iki.fi> wrote:
> Is int16 enough though? It seems so, because:
>
>      max_pinned_buffers = Max(max_ios * 4, buffer_io_size);
>
> and max_ios is constrained by the GUC's maximum MAX_IO_CONCURRENCY, and
> buffer_io_size is constrained by MAX_BUFFER_IO_SIZE == PG_IOV_MAX == 32.
>
> If someone changes those constants though, int16 might overflow and fail
> in weird ways. I'd suggest being more careful here and explicitly clamp
> max_pinned_buffers at PG_INT16_MAX or have a static assertion or
> something. (I think it needs to be somewhat less than PG_INT16_MAX,
> because of the extra "overflow buffers" stuff and some other places
> where you do arithmetic.)


Clamp added.

> >       /*
> >        * We gave a contiguous range of buffer space to StartReadBuffers(), 
> > but
> >        * we want it to wrap around at max_pinned_buffers.  Move values that
> >        * overflowed into the extra space.  At the same time, put -1 in the 
> > I/O
> >        * slots for the rest of the buffers to indicate no I/O.  They are 
> > covered
> >        * by the head buffer's I/O, if there is one.  We avoid a % operator.
> >        */
> >       overflow = (stream->next_buffer_index + nblocks) - 
> > stream->max_pinned_buffers;
> >       if (overflow > 0)
> >       {
> >               memmove(&stream->buffers[0],
> >                               &stream->buffers[stream->max_pinned_buffers],
> >                               sizeof(stream->buffers[0]) * overflow);
> >               for (int i = 0; i < overflow; ++i)
> >                       stream->buffer_io_indexes[i] = -1;
> >               for (int i = 1; i < nblocks - overflow; ++i)
> >                       stream->buffer_io_indexes[stream->next_buffer_index + 
> > i] = -1;
> >       }
> >       else
> >       {
> >               for (int i = 1; i < nblocks; ++i)
> >                       stream->buffer_io_indexes[stream->next_buffer_index + 
> > i] = -1;
> >       }
>
> Instead of clearing buffer_io_indexes here, it might be cheaper/simpler
> to initialize the array to -1 in streaming_read_buffer_begin(), and
> reset buffer_io_indexes[io_index] = -1 in streaming_read_buffer_next(),
> after the WaitReadBuffers() call. In other words, except when an I/O is
> in progress, keep all the elements at -1, even the elements that are not
> currently in use.

Yeah that wasn't nice and I had already got as far as doing exactly
that ↑ on my own, but your second idea ↓ is better!

> Alternatively, you could remember the first buffer that the I/O applies
> to in the 'ios' array. In other words, instead of pointing from buffer
> to the I/O that it depends on, point from the I/O to the buffer that
> depends on it. The last attached patch implements that approach. I'm not
> wedded to it, but it feels a little simpler.

Yeah, nice improvement.

> >               if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE)
> >               {
> >                       /* Distance ramps up fast (behavior C). */
> >                       ...
> >               }
> >               else
> >               {
> >                       /* No advice; move towards full I/O size (behavior 
> > B). */
> >                       ...
> >               }
>
> The comment on ReadBuffersOperation says "Declared in public header only
> to allow inclusion in other structs, but contents should not be
> accessed", but here you access the 'flags' field.
>
> You also mentioned that the StartReadBuffers() argument list is too
> long. Perhaps the solution is to redefine ReadBuffersOperation so that
> it consists of two parts: 1st part is filled in by the caller, and
> contains the arguments, and 2nd part is private to bufmgr.c. The
> signature for StartReadBuffers() would then be just:
>
> bool StartReadBuffers(ReadBuffersOperation *operation);

Yeah.  I had already got as far as doing this on the regression
hunting expedition, but I kept some arguments for frequently changing
things, eg blocknum.  It means that the stuff that never changes is in
there, and the stuff that changes each time doesn't have to be written
to memory at all.

> That would make it OK to read the 'flags' field. It would also allow
> reusing the same ReadBuffersOperation struct for multiple I/Os for the
> same relation; you only need to change the changing parts of the struct
> on each operation.

Right.  Done.

> In the attached patch set, the first three patches are your v9 with no
> changes. The last patch refactors away 'buffer_io_indexes' like I
> mentioned above. The others are fixes for some other trivial things that
> caught my eye.

Thanks, all squashed into the patch.

In an offline chat with Robert and Andres, we searched for a better
name for the GUC.  We came up with "io_combine_limit".  It's easier to
document a general purpose limit than to explain what "buffer_io_size"
does (particularly since the set of affected features will grow over
time but starts so small).  I'll feel better about using it to control
bulk writes too, with that name.

I collapsed the various memory allocations into one palloc.  The
buffer array is now a buffers[FLEXIBLE_ARRAY_MEMBER].

I got rid of "finished" (now represented by distance == 0, I was
removing branches and variables).  I got rid of "started", which can
now be deduced (used for suppressing advice when you're calling
_next() because you need a block and we need to read it immediately),
see the function argument suppress_advice.

Here is a new proposal for the names, updated in v10:

read_stream_begin_relation()
read_stream_next_buffer()
void read_stream_end()

I think we'll finish up with different 'constructor' functions for
different kinds of streams.  For example I already want one that can
provide a multi-relation callback for use by recovery (shown in v1).
Others might exist for raw file access, etc.  The defining
characteristic of this one is that it accesses one specific
relation/fork.  Well, _relfork() might be more accurate but less easy
on the eye.  I won't be surprised if people not following this thread
have ideas after commit; it's certainly happened before something gets
renamed in beta and I won't mind a bit if that happens...

I fixed a thinko in the new ReadBuffer() implementation mentioned
before (thanks to Bilal for pointing this out): it didn't handle the
RBM_ZERO_XXX flags properly.  Well, it did work and the tests passed,
but it performed a useless read first.  I probably got mixed up when I
removed the extended interface which was capable of streaming zeroed
buffers but this simple one isn't and it wasn't suppressing the read
as it would need to.  Later I'll propose to add that back in for
recovery.

I fixed a recently added thinko in the circular queue: I mistakenly
thought I didn't need a spare gap between head and tail anymore
because now we never compare them, since we track the number of pinned
buffers instead, but after read_stream_next_buffer() returns, users of
per-buffer data need that data to remain valid until the next call.
So the recent refactoring work didn't survive contact with Melanie's
BHS work, which uses that.  We need to "wipe" the previous (spare)
one, which can't possibly be in use yet, and if anyone ever sees 0x7f
in their per-buffer data, it will mean that they illegally accessed
the older value after a new call to read_stream_next_buffer().  Fixed,
with comments to clarify.

Retesting with Melanie's latest BHS patch set and my random.sql
(upthread) gives the same system call trace as before.  The intention
of that was to demonstrate what exact sequences
effective_io_concurrency values give.  Now you can also run that with
different values of the new io_combine_limit.  If you run it with
io_combine_limit = '8kB', it looks almost exactly like master ie no
big reads allowed; the only difference is that read_stream.c refuses
to issue advice for strictly sequential reads:

effective_io_concurrency = 1, range size = 2
unpatched                              patched
==============================================================================
pread(93,...,8192,0x58000) = 8192      pread(84,...,8192,0x58000) = 8192
posix_fadvise(93,0x5a000,0x2000,...)   pread(84,...,8192,0x5a000) = 8192
pread(93,...,8192,0x5a000) = 8192      posix_fadvise(84,0xb0000,0x2000,...)
posix_fadvise(93,0xb0000,0x2000,...)   pread(84,...,8192,0xb0000) = 8192
pread(93,...,8192,0xb0000) = 8192      pread(84,...,8192,0xb2000) = 8192
posix_fadvise(93,0xb2000,0x2000,...)   posix_fadvise(84,0x108000,0x2000,...)
pread(93,...,8192,0xb2000) = 8192      pread(84,...,8192,0x108000) = 8192
posix_fadvise(93,0x108000,0x2000,...)  pread(84,...,8192,0x10a000) = 8192

You wouldn't normally see that though as the default io_combine_limit
would just merge those adjacent reads after the first one triggers
ramp-up towards behaviour C:

 effective_io_concurrency = 1, range size = 2
unpatched                              patched
==============================================================================
pread(93,...,8192,0x58000) = 8192      pread(80,...,8192,0x58000) = 8192
posix_fadvise(93,0x5a000,0x2000,...)   pread(80,...,8192,0x5a000) = 8192
pread(93,...,8192,0x5a000) = 8192      posix_fadvise(80,0xb0000,0x4000,...)
posix_fadvise(93,0xb0000,0x2000,...)   preadv(80,...,2,0xb0000) = 16384
pread(93,...,8192,0xb0000) = 8192      posix_fadvise(80,0x108000,0x4000,...)
posix_fadvise(93,0xb2000,0x2000,...)   preadv(80,...,2,0x108000) = 16384
pread(93,...,8192,0xb2000) = 8192      posix_fadvise(80,0x160000,0x4000,...)

I spent most of the past few days trying to regain some lost
performance.  Thanks to Andres for some key observations and help!
That began with reports from Bilal and Melanie (possibly related to
things Tomas had seen too, not sure) of regressions in all-cached
workloads, which I already improved a bit with the ABC algorithm that
minimised pinning for this case.  That is, if there's no recent I/O so
we reach what I call behaviour A, it should try to do as little magic
as possible.  But it turns out that wasn't enough!  It is very hard to
beat a tight loop that just does ReadBuffer(), ReleaseBuffer() over
millions of already-cached blocks, if you have to do exactly the same
work AND extra instructions for management.

There were two layers to the solution in this new version:  First, I
now have a special case in read_stream_next_buffer(), a sort of
open-coded specialisation for behaviour A with no per-buffer data, and
that got rid of most of the regression, but some remained.  Next,
Andres pointed out that ReadBuffer() itself, even though it is now
implemented on top of StartReadBuffers(nblocks = 1), was still beating
my special case code that calls StartReadBuffers(nblocks = 1), even
though it looks about the same, because bufmgr.c was able to inline
and specialise the latter for one block.  To give streaming_read.c
that power from its home inside another translation units, we needed
to export a special case singular StartReadBuffer() (investigation and
patch by Andres, added as co-author).  It just calls the plural
function with nblocks = 1, but it gets inlined.  So now the special
case for behaviour A drills through both layers, and hopefully now
there is no measurable regression.. need to test a bitt more.  Of
course we can't *beat* the old code in this case, yet, but...

(We speculate that a future tree-based buffer mapping table might
allow efficient lookup for a range of block numbers in one go, and
then it could be worth paying the book-keeping costs to find ranges.
Perhaps behaviour A and the associated special case code could then be
deleted, as you'd probably want to use multi-block magic all the time,
for both for I/O and mapping table lookups.  Or something like that?)

v10-0001-Provide-vectored-variant-of-ReadBuffer.patch
Description: Binary data

v10-0002-Provide-API-for-streaming-relation-data.patch
Description: Binary data

v10-0003-Use-streaming-I-O-in-pg_prewarm.patch
Description: Binary data

Re: Streaming I/O, vectored I/O (WIP)

Reply via email to