Hi, On Fri, Jun 12, 2026 at 06:01:45PM +0200, Peter Eisentraut wrote: > There are several places where the return value of pg_pread() or pg_pwrite() > is passed directly as the byte count to pgstat_count_io_op_time(). The > bytes argument of pgstat_count_io_op_time() is of type uint64, and so error > returns of -1 are going to passed as UINT64_MAX and added as such to the > internal statistics.
Nice catch! > In the attached patch, I have marked up those places. I agree with those places and did not find others. > I think the correction here would be to move the pgstat_count_io_op_time() > calls to after the error returns are handled. This is effectively how most > other code already behaves. For example, most smgr calls don't return on > error, so you don't get a chance to make any pgstat calls afterwards. It's > only the open-coded places where we can even do that. Sounds reasonable to me and done that way in the attached. > However, XLogPageRead() even goes out of its way to make an explicit > pgstat_count_io_op_time() call in the error branch. I suppose this could be > useful to record short reads, but a) this particular instance is still > faulty regarding -1, and b) other places don't do that. So it's a bit > unclear what the preferred behavior on error should be. What about keeping the intent (record short reads) by discarding r <= 0? (done in the attached). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
>From e0e42fd722b0be07ab8f59e340013a89e5794734 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Mon, 15 Jun 2026 05:44:08 +0000 Subject: [PATCH v1] Fix pgstat_count_io_op_time() calls passing error returns as bytes Several places pass the return value of pg_pread() or pg_pwrite() directly as the bytes argument to pgstat_count_io_op_time(). Since the bytes parameter is of type uint64, error returns of -1 are implicitly cast to UINT64_MAX, corrupting the I/O statistics byte counters in pg_stat_io. Fix this by moving the pgstat_count_io_op_time() calls to after the error handling, so they are only reached with a valid positive byte count. This is consistent with how most other code already behaves: smgr calls don't return on error, so pgstat calls are implicitly only reached on success. Reported-by: Peter Eisentraut <[email protected]> Author: Bertrand Drouvot <[email protected]> Discussion: https://postgr.es/m/[email protected] --- src/backend/access/transam/xlog.c | 6 +++--- src/backend/access/transam/xlogreader.c | 8 +++++--- src/backend/access/transam/xlogrecovery.c | 6 ++++-- src/backend/replication/walreceiver.c | 6 +++--- 4 files changed, 15 insertions(+), 11 deletions(-) 99.6% src/backend/access/transam/ diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6c2304fef33..652287da55f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2455,9 +2455,6 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) written = pg_pwrite(openLogFile, from, nleft, startoffset); pgstat_report_wait_end(); - pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, - IOOP_WRITE, start, 1, written); - if (written <= 0) { char xlogfname[MAXFNAMELEN]; @@ -2475,6 +2472,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) errmsg("could not write to log file \"%s\" at offset %u, length %zu: %m", xlogfname, startoffset, nleft))); } + + pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, + IOOP_WRITE, start, 1, written); nleft -= written; from += written; startoffset += written; diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 3145c58a9b1..9d64ae34932 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -1597,9 +1597,6 @@ WALRead(XLogReaderState *state, #ifndef FRONTEND pgstat_report_wait_end(); - - pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ, - io_start, 1, readbytes); #endif if (readbytes <= 0) @@ -1612,6 +1609,11 @@ WALRead(XLogReaderState *state, return false; } +#ifndef FRONTEND + pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ, + io_start, 1, readbytes); +#endif + /* Update state for read */ recptr += readbytes; nbytes -= readbytes; diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 73b78a83fa7..4d61795b483 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -3390,8 +3390,10 @@ retry: pgstat_report_wait_end(); - pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ, - io_start, 1, r); + /* Count I/O stats only for successful short reads */ + if (r > 0) + pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ, + io_start, 1, r); XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size); if (r < 0) diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index d19317703c1..05e2f690fa7 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -954,9 +954,6 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli) byteswritten = pg_pwrite(recvFile, buf, segbytes, (pgoff_t) startoff); pgstat_report_wait_end(); - pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, - IOOP_WRITE, start, 1, byteswritten); - if (byteswritten <= 0) { char xlogfname[MAXFNAMELEN]; @@ -976,6 +973,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli) xlogfname, startoff, segbytes))); } + pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, + IOOP_WRITE, start, 1, byteswritten); + /* Update state for write */ recptr += byteswritten; -- 2.34.1
