On Sat, Nov 15, 2025 at 10:58:11AM +0900, Akihiko Odaki wrote:
> On 2025/11/15 0:30, Peter Xu wrote:
> > On Fri, Nov 14, 2025 at 10:24:40AM +0900, Akihiko Odaki wrote:
> > > However it creates another problem. Think of the following sequence:
> > > 
> > > call_rcu_thread()              |
> > >                                 | call_rcu1()
> > >                                 |  qatomic_fetch_inc(&rcu_call_count)
> > >                                 |  qemu_sem_post(&sync_event)
> > >                                 |
> > >   qatomic_read(&rcu_call_count) |
> > >   enter_qs(false)               |
> > >    wait_for_readers(false)      |
> > >     qemu_sem_timedwait(         |
> > >      &sync_event, 10)           |
> > > 
> > > qemu_sem_timedwait() incorrectly interrupts the RCU thread and enters the
> > > force quiescent state.
> > 
> > First thing to mention is, IIUC above can't happen because if
> > call_rcu_thread() is already waked up and reaching enter_qs(), then there
> > should have been, for example, a prior call_rcu1() that incremented
> > rcu_call_count and posted to sync_event, hence rcu_call_count cannot be 0
> > anymore in the call_rcu1() above, because the sub happens later:
> > 
> > call_rcu_thread:
> >          ... sees rcu_call_count > 0, quit loop ...
> >          ...
> >          enter_qs(sleep);
> >          qatomic_sub(&rcu_call_count, n); <-------------------------
> >          ...
> > 
> > That means the concurrent call_rcu1() above will not post sem anymore
> > because it will only post it if rcu_call_count==0.
> Below is an extended version of the sequence:
> 
> call_rcu_thread()                |
>  qatomic_sub(&rcu_call_count, n) |
>  (sets rcu_call_count to 0)      |
>                                  | call_rcu1()
>                                  |  qatomic_fetch_inc(&rcu_call_count)
>                                  |  qemu_sem_post(&sync_event)
>  qatomic_read(&rcu_call_count)   |
>  enter_qs(false)                 |
>   wait_for_readers(false)        |
>   qemu_sem_timedwait(            |
>    &sync_event, 10)              |
> 
> Note that there may not be a qemu_sem_wait(&sync_event) call between
> qatomic_sub(&rcu_call_count, n) and qatomic_read(&rcu_call_count).

Yes, with a semaphore this may happen.

It may also happen that the sem keeps accumulating like this (for example,
wait_for_readers() may always see a lot of callbacks registered that is
more than RCU_CALL_MIN_SIZE), so it can be even more than 2.  Then when the
RCU callback storm finishes, call_rcu_thread() may see that >2 sem counter
and loop over in reading rcu_call_count a few times seeing zeros.

I think it's fine because rcu thread only runs slightly faster on recycling
in extreme rare cases like this, or a few more loops also only in extreme
corner cases (as I mentioned before, I was trying to ask whether high
frequency RCU free matters to us, that's why I was trying to count it in a
VM lifespan).  Hence IMHO it's not an error either, but a benign false
positive.

> 
> > 
> > Besides, IMHO replacing the event with sem shouldn't change similar
> > behavior comparing to when using events.  Because any spot that can post()
> > concurrently can also does qemu_event_set() concurrently... after all, we
> > only have a few spots resettting the event in the original patch, after the
> > reset a concurrent qemu_event_set() will re-activate it.
> 
> The sequence I showed is handled properly with properly placed
> qemu_event_reset():
> 
> call_rcu_thread()                |
>  qatomic_sub(&rcu_call_count, n) |
>  (sets rcu_call_count to 0)      |
>                                  | call_rcu1()
>                                  |  qatomic_fetch_inc(&rcu_call_count)
>                                  |  qemu_event_set(&sync_event)
>  qatomic_read(&rcu_call_count)   |
>  qemu_event_reset(&sync_event)   |
>  enter_qs(false)                 |
>   wait_for_readers(false)        |
>    qemu_sem_timedwait(           |
>     &sync_event, 10)             |
> 
> Note that a concurrent qemu_event_set() after resetting the event can only
> triggered with force_rcu(), which is intended to interrupt
> wait_for_readers().

I see your point here.  I agree I can't think of any way to trigger false
positives like when sem is used.

> 
> > 
> > Sem does introduce possible false positives on events, but as I was trying
> > to justify, a generic VM boots and shutdown should need less than 1000 rcu
> > calls, so worst case is we loop read rcu_call_count 1000 times... I also
> > didn't measure how many of such call_rcu1() will see N>0 so they'll not
> > post() to sem at all, I believe it'll be common.  So the perf overhead
> > should really be rare, IMHO.
> > 
> > Note that I was not requesting you to replace this event to sem before
> > merging.  Frankly, I think your work is good enough to land.
> > 
> > However I still want to discuss a sem replacement, not only because this is
> > the best place to discuss it (after all, force rcu starts to introduce a
> > genuine second user of the event besides call_rcu1, hence the "being able
> > to remember more than 1" will start to make more sense than avoid resets),
> > it looks also cleaner to remove some new code force rcu adds.
> 
> How do "being able to remember more than 1" help?

It helps to avoid resets and make the code easier to follow.

The current rcu code, especially after your series lands, will have quite
some (1) more deliberate event reset needed, and (2) ordering constraints
between any of: event waits, event resets, reading "force" var, reading
"rcu_call_count" var, etc.  These things need very careful attention,
missing or making it wrong any spot may introduce hard to debug functional
issues.

The sem proposal was almost about trying to remove almost all of those
complexities.

But I agree the drawback is sem may introduce false positives like above,
so at least it's not an ideal proposal.  It's just that as mentioned it
should be all safe and fine IMHO, unless we care about high freq rcu frees.
As I said, I don't have a strong opinion. I'll leave that to you to decide.

Thanks,

-- 
Peter Xu


Reply via email to