Re: Non-reproducible AIO failure

Konstantin Knizhnik Tue, 17 Jun 2025 07:54:39 -0700


On 12/06/2025 4:57 pm, Andres Freund wrote:

The problem appears to be in that switch between "when submitted, by the IO
worker" and "then again by the backend".  It's not concurrent access in the
sense of two processes writing to the same value, it's that when switching
from the worker updating ->distilled_result to the issuer looking at that, the
issuer didn't ensure that no outdated version of ->distilled_result could be
used.

Basically, the problem is that the worker would

1) set ->distilled_result
2) perform a write memory barrier
3) set ->state to COMPLETED_SHARED

and then the issuer of the IO would:

4) check ->state is COMPLETED_SHARED
5) use ->distilled_result

The problem is that there currently is no barrier between 4 & 5, which means
an outdated ->distilled_result could be used.


This also explains why the issue looked so weird - eventually, after fprintfs,
after a core dump, etc, the updated ->distilled_result result would "arrive"
in the issuing process, and suddenly look correct.

Sorry, I realized that O do not completely understand how it canexplained assertion failure in `pgaio_io_before_start`:


    Assert(ioh->op == PGAIO_OP_INVALID);

I am not so familiar with AIO implementation in Postgress, so pleaseexcuse me if I missing something...


Ok, `distilled_result` contains some deteriorated data.
As far as I understand it is used for calling callbacks.

Right now in debugger (loaded core file), I see thsat two callbacks areregistered: `aio_md_readv_cb` and `aio_shared_buffer_readv_cb`.

None of them is changing "op" field.

This is full content of the handle:

(PgAioHandle) {
  state = PGAIO_HS_HANDED_OUT
  target = PGAIO_TID_SMGR
  op = PGAIO_OP_READV
  flags = '\x04'
  num_callbacks = '\x02'
  callbacks = {
    [0] = '\x02'
    [1] = '\x01'
    [2] = '\0'
    [3] = '\0'
  }
  callbacks_data = {
    [0] = '\0'
    [1] = '\0'
    [2] = '\0'
    [3] = '\0'
  }
  handle_data_len = '\x01'
  owner_procno = 28
  result = 0
  node = {
    prev = 0x000000010c9ea368
    next = 0x000000010c9ea368
  }
  resowner = 0x000000014880ef28
  resowner_node = {
    prev = 0x000000014880f1d8
    next = 0x000000014880f1d8
  }
  generation = 133159566
  cv = {
    mutex = 0
    wakeup = (head = -1, tail = -1)
  }
  distilled_result = (id = 0, status = 0, error_data = 0, result = 1)
  iovec_off = 448
  report_return = 0x000000010f81b950
  op_data = {
    read = (fd = 7, iov_length = 1, offset = 180936704)
    write = (fd = 7, iov_length = 1, offset = 180936704)
  }
  target_data = {
    smgr = {
      rlocator = (spcOid = 1663, dbOid = 5, relNumber = 16384)
      blockNum = 22088
      nblocks = 1
      forkNum = MAIN_FORKNUM
      is_temp = false
      skip_fsync = false
    }
  }
}


Actually `PGAIO_OP_READV` is set in the only place: `pgaio_io_start_readv`.

And this function is obviously called in user's backed with procno=28which letter gets this assert failure.


So I try to understand dataflow.

1. Backend calls `pgaio_io_start_readv` to initiate IO. It assigns`op=PGAIO_OP_READV`2. It is broadcasted to all CPUs because of write barrier in`pgaio_io_update_state(ioh, PGAIO_HS_DEFINED)` called fro `pgaio_io_stage`.3. AIO thread gets notification about this request and somehow handle itchanging status to `COMPLETED_SHARED` in `pgaio_io_process_completion`which notifies backend using condvar (which is expected to enforce writebarrier).4. Backend see that status is `COMPLETED_SHARED`. Backend can do it in`pgaio_io_wait` or in `pgaio_io_wait_for_free`. In first case thereseems to be no problem, because `pgaio_io_was_recycled` is used whichdoes read barrier. So you suspect `pgaio_io_wait_for_free` where readbarrier was missed (before your patch). But it is not quite clear to mewhat can happen with "op" field without read barrier? In theory backendcan see old value. In practice it seems to be impossible because as faras I know the unit of syncing data between CPU is cache line.

So my hypothesis was the following: backend observed old value of "op"(PGAIO_OP_READV, because this is what it has written itself), thenoverwrites it with PGAIO_OP_INVALID in `pgaio_io_reclaim` and thendelayed update from AIO worker is received which restores old value -PGAIO_OP_READV.

But it seems to be not so realistic scenario.

Can you explain please how according to you theory not up-to-dateversion of distilled_result can explain such behavior?It will be possible if result contains index of the handle and staleversion of the index can cause addressing from IO handle.But it is not the case - as we can see from the trace, address of `ioh`in the backens is always the same.

This is what I am observing myself: "op"  field is modified and fetched by
the same process.

Right - but I don't think the ->op field being wrong was actually part of the
issue though.

Certainly process can be rescheduled to some other CPU. But if such
reschedule can cause loose of stored value, then nothing will work, will it?

Yes, that'd be completely broken - but isn't the issue here.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

Reply via email to