Re: Show WAL write and fsync stats in pg_stat_io

Nazir Bilal Yavuz Thu, 11 Jan 2024 03:19:20 -0800

Hi,

On Thu, 11 Jan 2024 at 08:01, Michael Paquier <mich...@paquier.xyz> wrote:
>
> On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> > I have code review feedback as well, but I've saved that for my next email.
>
> Ah, cool.
>
> > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavu...@gmail.com> wrote:
> >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier <mich...@paquier.xyz> wrote:
> >> Oh, I understand it now. Yes, that makes sense.
> >> I thought removing op_bytes completely ( as you said "This patch
> >> extends it with two more operation sizes, and there are even cases
> >> where it may be a variable" ) from pg_stat_io view then adding
> >> something like {read | write | extend}_bytes and {read | write |
> >> extend}_calls could be better, so that we don't lose any information.
> >
> > Upthread, Michael says:
> >
> >> I find the use of 1 in this context a bit confusing, because when
> >> referring to a counter at N, then it can be understood as doing N
> >> times a operation,
> >
> > I didn't understand this argument, so I'm not sure if I agree or
> > disagree with it.
>
> Nazir has mentioned upthread one thing: what should we do for the case
> where a combination of (io_object,io_context) does I/O with a
> *variable* op_bytes, because that may be the case for the WAL
> receiver?  For this case, he has mentioned that we should set op_bytes
> to 1, but that's something I find confusing because it would mean that
> we are doing read, writes or extends 1 byte at a time.  My suggestion
> would be to use op_bytes = -1 or NULL for the variable case instead,
> with reads, writes and extends referring to a number of bytes rather
> than a number of operations.


I agree but we can't do this only for the *variable* cases since
B_WAL_RECEIVER and other backends use the same
pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
is, if two backends use the same pgstat_count_io_op_time() function
call in the code; they must count the same thing (number of calls,
bytes, etc.). It could be better to count the number of bytes for all
the IOOBJECT_WAL IOs.

> > I think these are the three proposals for handling WAL reads:
> >
> > 1) setting op_bytes to 1 and the number of reads is the number of bytes
> > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> > number of calls to pg_pread() or similar
> > 3) setting op_bytes to NULL and the number of reads is the number of
> > calls to pg_pread() or similar
>
> 3) could be a number of bytes, actually.

One important point is that we can't change only reads, if we decide
to count the number of bytes for the reads; writes and extends should
be counted as a number of bytes as well.

> > Looking at the patch, I think it is still doing 2.
>
> The patch disables stats for the WAL receiver, while the startup
> process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
> discard the variable case.
>
> > For an unpopular idea: we could add separate [IOOp]_bytes columns for
> > all those IOOps for which it would be relevant. It kind of stinks but
> > it would give us the freedom to document exactly what a single IOOp
> > means for each combination of BackendType, IOContext, IOObject, and
> > IOOp (as relevant) and still have an accurate number in the *bytes
> > columns. Everyone will probably hate us if we do that, though.
> > Especially because having bytes for the existing IOObjects is an
> > existing feature.
>
> An issue I have with this one is that having multiple tuples for
> each (object,context) if they have multiple op_bytes leads to
> potentially a lot of bloat in the view.  That would be up to 8k extra
> tuples just for the sake of op_byte's variability.

Yes, that doesn't seem applicable to me.

> > A separate question: suppose [1] goes in (to read WAL from WAL buffers
> > directly). Now, WAL reads are not from permanent storage anymore. Are
> > we only tracking permanent storage I/O in pg_stat_io? I also had this
> > question for some of the WAL receiver functions. Should we track any
> > I/O other than permanent storage I/O? Or did I miss this being
> > addressed upthread?
>
> That's a good point.  I guess that this should just be a different
> IOOp?  That's not a IOOP_READ.  A IOOP_HIT is also different.

I think different IOContext rather than IOOp suits better for this.

> > In terms of what I/O we should track in a streaming/asynchronous
> > world, the options would be:
> >
> > 1) track read/write syscalls
> > 2) track blocks of BLCKSZ submitted to the kernel
> > 3) track bytes submitted to the kernel
> > 4) track merged I/Os (after doing any merging in the application)
> >
> > I think the debate was largely between 2 and 4. There was some
> > disagreement, but I think we landed on 2 because there is merging that
> > can happen at many levels in the storage stack (even the storage
> > controller). Distinguishing between whether or not Postgres submitted
> > 2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
> > but I think it might be confusing for the Postgres user trying to
> > determine why their query is slow. It probably makes the most sense to
> > still track in block size.
> >
> > No matter what solution we pick, you should get a correct number if
> > you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
> > rather, there should be some way of getting an accurate number in
> > bytes of the amount of a particular kind of I/O that has been done.
>
> Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads,
> writes or extends are counted as bytes, because we don't have a fixed
> operation size for some (object,context) cases.

Can't we use 2 and 3 together? For example, use 3 for the IOOBJECT_WAL
IOs and 2 for the other IOs.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

Reply via email to