Hello Tomas,

01.09.2023 16:00, Tomas Vondra wrote:
Hmmm, I'm not very good at reading the binary code, but here's what
objdump produced for WaitEventSetWait. Maybe someone will see what the
issue is.

At first glance, I can't see anything suspicious in the disassembly.
IIUC, waiting = true presented there as:
  805c38: b902ad18      str     w24, [x8, #684] // pgstat_report_wait_start(): 
proc->wait_event_info = wait_event_info;
// end of pgstat_report_wait_start(wait_event_info);

  805c3c: b0ffdb09      adrp    x9, 0x366000 <dsm_segment_address+0x24>
  805c40: b0ffdb0a      adrp    x10, 0x366000 <dsm_segment_address+0x28>
  805c44: f0000eeb      adrp    x11, 0x9e4000 <PMSignalShmemInit+0x4>

  805c48: 52800028      mov     w8, #1 // true
  805c4c: 52800319      mov     w25, #24
  805c50: 5280073a      mov     w26, #57
  805c54: fd446128      ldr     d8, [x9, #2240]
  805c58: 90000d7b      adrp    x27, 0x9b1000 <ModifyWaitEvent+0xb0>
  805c5c: fd415949      ldr     d9, [x10, #688]
  805c60: f9071d68      str     x8, [x11, #3640] // waiting = true (x8 = w8)
So there are two simple mov's and two load operations performed in parallel,
but I don't think it's similar to what we had in that case.

I thought about maybe just adding the barrier in the code, but then how
would we know it's the issue and this fixed it? It happens so rarely we
can't make any conclusions from a couple runs of tests.

Probably I could construct a reproducer for the lockup if I had access to
the such machine for a day or two.

Best regards,
Alexander


Reply via email to