On 2025/05/14 9:51, 'Paolo Bonzini' via devel wrote:
On 5/11/25 08:08, Akihiko Odaki wrote:
@@ -387,12 +365,17 @@ void qemu_event_set(QemuEvent *ev)
      assert(ev->initialized);
      /*
-     * Pairs with both qemu_event_reset() and qemu_event_wait().
+     * Pairs with qemu_event_wait() (on Linux) and qemu_event_reset().
       *
       * qemu_event_set has release semantics, but because it *loads*
       * ev->value we need a full memory barrier here.
+     *
+     * qemu_event_wait() do not have a paired barrier on non-Linux because
+     * ev->lock ensures ordering.
       */

Thanks for trying to write down the idea.  This works, but there's
still room for improvement. :)  The insight is that the logic becomes:

   if (done == 0) {                  done = 1;
     <read ev->value>                ev->value = EV_SET;
     ev->value = EV_FREE;
     if (done == 0) {
       pthread_cond_wait();          pthread_cond_broadcast();
     }
   }

which is a lot simplified compared to the futex case because all the
differences between EV_FREE and EV_BUSY are replaced by the condition
variable.  In order to avoid the slow path, all that's needed is to
ensure that if qemu_event_reset() sees EV_SET, the caller also sees
done == 1 in the second "if".

The futex case needs qatomic_or() because there can be a concurrent
EV_FREE->EV_BUSY, and a smp_mb() because qemu_event_set() needs to
*read* ev->value.  The non-futex case instead it can do this:

   if (done == 0) {                    done = 1;
     v = load_acquire(ev->value); <--- store_release(value, EV_SET);
     ev->value = v | EV_FREE;
     if (done == 0) {
       pthread_cond_wait();          pthread_cond_broadcast();
     }
   }

where load<---store indicates that the load on the left reads the
value that the store writes, and the store "happens before" the load.
> In other words:>
- because v reads EV_SET, everything before the store_release is ordered
before everything that follows the load_acquire

- therefore if v reads EV_SET, the pthread_cond_wait() won't be reached
and there's no possibility of a hang (I think :))


This becomes the following patch:

diff --git a/util/event.c b/util/event.c
index 20853d61a7c..489cec08de7 100644
--- a/util/event.c
+++ b/util/event.c
@@ -47,18 +47,14 @@ void qemu_event_set(QemuEvent *ev)
  {
      assert(ev->initialized);

+#ifdef HAVE_FUTEX
      /*
-     * Pairs with qemu_event_wait() (on Linux) and qemu_event_reset().
-     *
+     * Pairs with qemu_event_wait() and qemu_event_reset().
       * qemu_event_set has release semantics, but because it *loads*
       * ev->value we need a full memory barrier here.
-     *
-     * qemu_event_wait() do not have a paired barrier on non-Linux because
-     * ev->lock ensures ordering.
       */
      smp_mb();

-#ifdef HAVE_FUTEX
      if (qatomic_read(&ev->value) != EV_SET) {
          int old = qatomic_xchg(&ev->value, EV_SET);

@@ -71,7 +67,8 @@ void qemu_event_set(QemuEvent *ev)
      }
  #else
      pthread_mutex_lock(&ev->lock);
-    qatomic_set(&ev->value, EV_SET);
+    /* Pairs with qemu_event_reset()'s load acquire.  */
+    qatomic_store_release(&ev->value, EV_SET);
      pthread_cond_broadcast(&ev->cond);
      pthread_mutex_unlock(&ev->lock);
  #endif
@@ -81,6 +78,7 @@ void qemu_event_reset(QemuEvent *ev)
  {
      assert(ev->initialized);

+#ifdef HAVE_FUTEX
      /*
       * If there was a concurrent reset (or even reset+wait),
       * do nothing.  Otherwise change EV_SET->EV_FREE.
@@ -92,6 +90,28 @@ void qemu_event_reset(QemuEvent *ev)
       * Pairs with the first memory barrier in qemu_event_set().
       */
      smp_mb__after_rmw();
+#else
+    /*
+     * If futexes are not available, there are no EV_FREE->EV_BUSY
+     * transitions because wakeups are done entirely through the
+     * condition variable.  Since qatomic_set() always writes EV_FREE
+     * the load seems useless, but in reality its acquire synchronizes
+     * with qemu_event_set()'s store-release: if qemu_event_reset()
+     * sees EV_SET here, then the caller will certainly see a
+     * successful condition and skip qemu_event_wait():
+     *
+     * done = 1;                 if (done == 0)
+     * qemu_event_set() {          qemu_event_reset() {
+     *   lock();
+     *   ev->value = EV_SET ----->     v = ev->value
+     *                                 ev->value = v | EV_FREE
+     *   cond_broadcast()
+     *   unlock();                 }
+     * }                           if (done == 0)
+     *                               // qemu_event_wait() not called
+     */
+    qatomic_set(&ev->value, qatomic_load_acquire(&ev->value) | EV_FREE);
+#endif
  }

  void qemu_event_wait(QemuEvent *ev)


You don't need to post v4 (which would include the above diff in
this patch, in qemu-thread-posix.c); I can post the above change
as a follow up too.  What you have now is not wrong, it's only the
explanation that isn't too precise.

Honestly, I do not understand why smp_mb() is placed here even in the futex case. The comment says:
> qemu_event_set has release semantics, but because it *loads*
> ev->value we need a full memory barrier here.

The barrier is indeed followed by a load: qatomic_read(&ev->value) != EV_SET
However, the caller of qemu_event_set() should not care whether the event is already set or not, so I'm not sure if smp_mb() is needed in the first place.

Perhaps what is missing here is a clear documentation of concurrency semantics of these functions. In my understanding, these functions need to satisfy the following semantics:

qemu_event_set(): release *if the event is not already set*; otherwise it has no effect on synchronization so we don't need a barrier either.

qemu_event_reset(): acquire; this enables checking the state for the event set before resetting.

qemu_event_wait(): acquire

Regards,
Akihiko Odaki

Reply via email to