Hi, Sorry for the slow work on this. The cycle times are humonguous due to valgrind being so slow...
On 2025-04-03 12:40:23 -0700, Noah Misch wrote: > On Thu, Apr 03, 2025 at 02:19:43PM -0400, Andres Freund wrote: > > The best fix for that one would, I think, be to have method_io_uring() > > iterate > > over the IOV and mark the relevant regions as defined? That does fix the > > issue at least and does seem to make sense? > > Makes sense. Valgrind knows that read() makes its target bytes "defined". It > probably doesn't have an io_uring equivalent for that. Correct - and I think it would be nontrivial to add, because there's not easy syscall to intercept... > I expect we only need this for local buffers, and it's unclear to me how the > fix for (4a) didn't fix this. At that time I didn't apply the fix in 4a) to local buffers, because local buffers, in HEAD, don't have the valgrind integration. Without that marking the buffer as NOACCESS would cause all sorts of issues, because it'd be considered inaccessible even after pinning. As you analyzed, that then ends up considered undefined due to the MemoryContextAlloc(). > In the general case, we could want client requests as follows: > > - If completor==definer and has not dropped pin: > - Make defined before verifying page. That's all. It might be cleaner to > do this when first retrieving a return value from io_uring, since this > just makes up for what Valgrind already does for readv(). Yea, I think it's better to do that in io_uring. It's what I have done in the attached. > - If completor!=definer or has dropped pin: > - Make NOACCESS in definer when definer cedes its own pin. That's the current behaviour for shared buffers, right? > - For io_method=worker, make UNDEFINED before starting readv(). It might be > cleanest to do this when the worker first acts as the owner of the AIO > subsystem pin, if that's a clear moment earlier than readv(). Hm, what do we need this for? > - Make DEFINED in completor before verifying page. It might be cleaner to > do this when the completor first retrieves a return value from io_uring, > since this just makes up for what Valgrind already does for readv(). I think we can't rely on the marking during retrieving it from io_uring, as that might have happened in a different backend for a temp buffer. That'd only happen if we got io_uring events for *another* IO that involved a shared rel, but it can happen. > > Not quite sure if we should mark > > the entire IOV is efined or just the portion that was actually read - the > > latter is additional fiddly code, and it's not clear it's likely to be > > helpful? > > Seems fine to do the simpler way if that saves fiddly code. Can't quite decide, it's just at the border of what I consider too fiddly... See the change to method_io_uring.c in the attached patch. > > Questions: > > > > 1) It'd be cleaner to implement valgrind support in localbuf.c, so we don't > > need to have special-case logic for that. But it also makes the change > > less > > localized and more "impactful", who knows what kind of skullduggery we > > have > > been getting away with unnoticed. > > > > I haven't written the code up yet, but I don't think it'd be all that > > much > > code to add valgrind support to localbuf. > > It would be the right thing long-term, and it's not a big deal if it causes > some false positives initially. So if you're leaning that way, that's good. It was easy enough. I saw one related failure, FlushRelationBuffers() didn't pin temporary buffers before flushing them. Pinning the buffers fixed that. I don't think it's a real problem to not pin the local buffer during FlushRelationBuffers(), at least not today. But it seems unnecessarily odd to not pin it. I wish valgrind had a way to mark the buffer as inaccessible and then accessible again, without loosing the defined-ness information... Greetings, Andres Freund
>From 472b75150ffaca1d30147f764208341181c31571 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Fri, 4 Apr 2025 14:55:30 -0400 Subject: [PATCH v1 1/3] localbuf: Add Valgrind buffer access instrumentation This mirrors 1e0dfd166b3 (+ 46ef520b9566), for temporary table buffers. This is mainly interesting right now because the AIO work currently triggers spurious valgrind errors, and the fix for that is cleaner if temp buffers behave the same as shared buffers. This requires one change beyond the annotations themselves, namely to pin local buffers while writing them out in FlushRelationBuffers(). Reviewed-by: Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6 --- src/backend/storage/buffer/bufmgr.c | 13 +++++++++++++ src/backend/storage/buffer/localbuf.c | 9 +++++++++ 2 files changed, 22 insertions(+) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 1c37d7dfe2f..13cebad9b12 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -4895,8 +4895,21 @@ FlushRelationBuffers(Relation rel) errcallback.previous = error_context_stack; error_context_stack = &errcallback; + /* Make sure we can handle the pin */ + ReservePrivateRefCountEntry(); + ResourceOwnerEnlarge(CurrentResourceOwner); + + /* + * Pin/upin mostly to make valgrind work, but it also seems + * like the right thing to do. + */ + PinLocalBuffer(bufHdr, false); + + FlushLocalBuffer(bufHdr, srel); + UnpinLocalBuffer(BufferDescriptorGetBuffer(bufHdr)); + /* Pop the error context stack */ error_context_stack = errcallback.previous; } diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index ed56202af14..fb44d756dad 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -23,6 +23,7 @@ #include "storage/bufmgr.h" #include "storage/fd.h" #include "utils/guc_hooks.h" +#include "utils/memdebug.h" #include "utils/memutils.h" #include "utils/resowner.h" @@ -183,6 +184,8 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln) instr_time io_start; Page localpage = (char *) LocalBufHdrGetBlock(bufHdr); + Assert(LocalRefCount[-BufferDescriptorGetBuffer(bufHdr) - 1] > 0); + /* * Try to start an I/O operation. There currently are no reasons for * StartLocalBufferIO to return false, so we raise an error in that case. @@ -808,6 +811,9 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount) buf_state += BUF_USAGECOUNT_ONE; } pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state); + + /* see comment in PinBuffer() */ + VALGRIND_MAKE_MEM_DEFINED(LocalBufHdrGetBlock(buf_hdr), BLCKSZ); } LocalRefCount[bufid]++; ResourceOwnerRememberBuffer(CurrentResourceOwner, @@ -843,6 +849,9 @@ UnpinLocalBufferNoOwner(Buffer buffer) Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0); buf_state -= BUF_REFCOUNT_ONE; pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state); + + /* see comment in UnpinBufferNoOwner */ + VALGRIND_MAKE_MEM_NOACCESS(LocalBufHdrGetBlock(buf_hdr), BLCKSZ); } } -- 2.48.1.76.g4e746b1a31.dirty
>From 31759bafc9a44442f14b6363689bc343f0f9fb86 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Fri, 4 Apr 2025 15:03:34 -0400 Subject: [PATCH v1 2/3] aio: Make AIO compatible with valgrind In some edge cases valgrind flags issues with AIO related code. All of the cases addressed in this change are false positives. Most are caused by UnpinBuffer[NoOwner] marking buffer data as inaccessible. This happens even though the AIO subsystem still holds a pin. That's good, there shouldn't be accesses to the buffer outside of AIO related code until it is pinned bu "user" code again. But it requires some explicit work. There is a bit of additional work due to temp tables, as valgrind does not understand io_uring sufficiently to mark buffer contents as defined after a read. Per buildfarm animal skink. Reviewed-by: Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6 --- src/include/storage/aio_internal.h | 1 + src/backend/storage/aio/aio_io.c | 23 +++++++++++ src/backend/storage/aio/method_io_uring.c | 49 ++++++++++++++++++++++- src/backend/storage/aio/method_worker.c | 18 +++++++++ src/backend/storage/buffer/bufmgr.c | 19 +++++++++ 5 files changed, 109 insertions(+), 1 deletion(-) diff --git a/src/include/storage/aio_internal.h b/src/include/storage/aio_internal.h index 7f18da2c856..33f27b9fe50 100644 --- a/src/include/storage/aio_internal.h +++ b/src/include/storage/aio_internal.h @@ -344,6 +344,7 @@ extern PgAioResult pgaio_io_call_complete_local(PgAioHandle *ioh); extern void pgaio_io_perform_synchronously(PgAioHandle *ioh); extern const char *pgaio_io_get_op_name(PgAioHandle *ioh); extern bool pgaio_io_uses_fd(PgAioHandle *ioh, int fd); +extern int pgaio_io_get_iovec_length(PgAioHandle *ioh, struct iovec **iov); /* aio_target.c */ extern bool pgaio_io_can_reopen(PgAioHandle *ioh); diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c index 4d31392ddc7..bd8be987526 100644 --- a/src/backend/storage/aio/aio_io.c +++ b/src/backend/storage/aio/aio_io.c @@ -210,3 +210,26 @@ pgaio_io_uses_fd(PgAioHandle *ioh, int fd) return false; /* silence compiler */ } + +/* + * Return the iovecand its length. Currently only expected to be used by + * debugging infrastructure + */ +int +pgaio_io_get_iovec_length(PgAioHandle *ioh, struct iovec **iov) +{ + Assert(ioh->state >= PGAIO_HS_DEFINED); + + *iov = &pgaio_ctl->iovecs[ioh->iovec_off]; + + switch (ioh->op) + { + case PGAIO_OP_READV: + return ioh->op_data.read.iov_length; + case PGAIO_OP_WRITEV: + return ioh->op_data.write.iov_length; + default: + pg_unreachable(); + return 0; + } +} diff --git a/src/backend/storage/aio/method_io_uring.c b/src/backend/storage/aio/method_io_uring.c index c719ba2727a..1a90cadfd49 100644 --- a/src/backend/storage/aio/method_io_uring.c +++ b/src/backend/storage/aio/method_io_uring.c @@ -38,6 +38,7 @@ #include "storage/shmem.h" #include "storage/lwlock.h" #include "storage/procnumber.h" +#include "utils/memdebug.h" #include "utils/wait_event.h" @@ -324,6 +325,49 @@ pgaio_uring_completion_error_callback(void *arg) errcontext("completing I/O on behalf of process %d", owner_pid); } +static void +pgaio_uring_io_process_completion(PgAioHandle *ioh, int32 res) +{ + /* + * Valgrind does not understand io_uring sufficiently to mark the + * referenced region as defined on IO completion, and it'd probably be + * nontrivial to teach it to do so. So we just have to do the legwork + * ourselves. + */ +#ifdef USE_VALGRIND + if (ioh->op == PGAIO_OP_READV) + { + struct iovec *iov; + uint16 iov_length = pgaio_io_get_iovec_length(ioh, &iov); + int32 processed = 0; + + for (int i = 0; i < iov_length; i++) + { + const char *base = iov[i].iov_base; + int32 len = iov[i].iov_len; + + Assert(iov[i].iov_len <= PG_INT32_MAX); + + if (processed + len <= res) + VALGRIND_MAKE_MEM_DEFINED(base, len); + else if (processed <= res) + { + size_t middle = res - processed; + + VALGRIND_MAKE_MEM_DEFINED(base, middle); + VALGRIND_MAKE_MEM_UNDEFINED(base + middle, len - middle); + } + else + VALGRIND_MAKE_MEM_UNDEFINED(base, len); + + processed += len; + } + } +#endif + + pgaio_io_process_completion(ioh, res); +} + static void pgaio_uring_drain_locked(PgAioUringContext *context) { @@ -361,13 +405,16 @@ pgaio_uring_drain_locked(PgAioUringContext *context) for (int i = 0; i < ncqes; i++) { struct io_uring_cqe *cqe = cqes[i]; + int32 res; PgAioHandle *ioh; ioh = io_uring_cqe_get_data(cqe); errcallback.arg = ioh; + res = cqe->res; + io_uring_cqe_seen(&context->io_uring_ring, cqe); - pgaio_io_process_completion(ioh, cqe->res); + pgaio_uring_io_process_completion(ioh, res); errcallback.arg = NULL; } diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c index 31d94ac82c5..232b79be4c4 100644 --- a/src/backend/storage/aio/method_worker.c +++ b/src/backend/storage/aio/method_worker.c @@ -42,6 +42,7 @@ #include "storage/latch.h" #include "storage/proc.h" #include "tcop/tcopprot.h" +#include "utils/memdebug.h" #include "utils/ps_status.h" #include "utils/wait_event.h" @@ -529,6 +530,23 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len) error_errno = 0; error_ioh = NULL; + /* + * As part of IO completion the buffer will be marked as + * non-accessible, until the buffer is pinned again. The next time + * there is IO for the same buffer, the memory will be considered + * inaccessible. Therefore we need to explicitly allow access to + * the memory before reading data into it. + */ +#ifdef USE_VALGRIND + { + struct iovec *iov; + uint16 iov_length = pgaio_io_get_iovec_length(ioh, &iov); + + for (int i = 0; i < iov_length; i++) + VALGRIND_MAKE_MEM_UNDEFINED(iov[i].iov_base, iov[i].iov_len); + } +#endif + /* * We don't expect this to ever fail with ERROR or FATAL, no need * to keep error_ioh set to the IO. diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 13cebad9b12..d34a1e335e5 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -6883,6 +6883,19 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer, { PgAioResult result_one; + /* + * If the buffer is not currently pinned by this backend, e.g. because + * we're completing this IO after an error, the buffer data will have + * been marked as inaccessible when the buffer was unpinned. The AIO + * subystem holds a pin, but that doesn't prevent the buffer from + * having been marked as inaccessible. The completion might also be + * executed in a different process. + */ +#ifdef USE_VALGRIND + if (!BufferIsPinned(buffer)) + VALGRIND_MAKE_MEM_DEFINED(bufdata, BLCKSZ); +#endif + if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags, failed_checksum)) { @@ -6901,6 +6914,12 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer, else if (*failed_checksum) *ignored_checksum = true; + /* undo what we did above */ +#ifdef USE_VALGRIND + if (!BufferIsPinned(buffer)) + VALGRIND_MAKE_MEM_NOACCESS(bufdata, BLCKSZ); +#endif + /* * Immediately log a message about the invalid page, but only to the * server log. The reason to do so immediately is that this may be -- 2.48.1.76.g4e746b1a31.dirty
>From 8b059ca837dbbf77344b6a45cd66f0c0935769fc Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Fri, 4 Apr 2025 15:15:39 -0400 Subject: [PATCH v1 3/3] aio: Avoid spurious coverity warning PgAioResult.result is never accessed in the relevant path, but coverity complains about an uninitialized access anyway. So just zero-initialize the whole thing. While at it, reduce the scope of the variable. Reported-by: Ranier Vilela <ranier...@gmail.com> Discussion: https://postgr.es/m/caeudqapskqd-s+fsuq0omxjamhmbsxxraz3dcs+uvqb3irt...@mail.gmail.com --- src/backend/storage/buffer/bufmgr.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index d34a1e335e5..8265bf9fd10 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -6881,8 +6881,6 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer, /* Check for garbage data. */ if (!failed) { - PgAioResult result_one; - /* * If the buffer is not currently pinned by this backend, e.g. because * we're completing this IO after an error, the buffer data will have @@ -6936,6 +6934,8 @@ buffer_readv_complete_one(PgAioTargetData *td, uint8 buf_off, Buffer buffer, */ if (*buffer_invalid || *failed_checksum || *zeroed_buffer) { + PgAioResult result_one = {0}; + buffer_readv_encode_error(&result_one, is_temp, *zeroed_buffer, *ignored_checksum, -- 2.48.1.76.g4e746b1a31.dirty