On 2025/10/20 5:23, Dmitry Osipenko wrote:
On 10/17/25 03:40, Akihiko Odaki wrote:
On 2025/10/17 8:43, Akihiko Odaki wrote:
On 2025/10/17 4:33, Dmitry Osipenko wrote:
On 10/16/25 09:34, Akihiko Odaki wrote:
-        /* Wait for one thread to report a quiescent state and try
again.
+        /*
+         * Sleep for a while and try again.
            * Release rcu_registry_lock, so rcu_(un)register_thread()
doesn't
            * wait too much time.
            *
@@ -133,7 +150,20 @@ static void wait_for_readers(void)
            * rcu_registry_lock is released.
            */
           qemu_mutex_unlock(&rcu_registry_lock);
-        qemu_event_wait(&rcu_gp_event);
+
+        if (forced) {
+            qemu_event_wait(&rcu_gp_event);
+
+            /*
+             * We want to be notified of changes made to
rcu_gp_ongoing
+             * while we walk the list.
+             */
+            qemu_event_reset(&rcu_gp_event);
+        } else {
+            g_usleep(10000);
+            sleeps++;

Thanks a lot for this RCU improvement. It indeed removes the hard stalls
with unmapping of virtio-gpu blobs.

Am I understanding correctly that potentially we will be hitting this
g_usleep(10000) and stall virtio-gpu for the first ~10ms? I.e. the
MemoryRegion patches from Alex [1] are still needed to avoid stalls
entirely.

[1]
https://lore.kernel.org/qemu-devel/20251014111234.3190346-6-
[email protected]/

That is right, but "avoiding stalls entirely" also causes use-after-free.

The problem with virtio-gpu on TCG is that TCG keeps using the old
memory map until force_rcu is triggered. So, without force_rcu, the
following pseudo-code on a guest will result in use-after-free:

address = blob_map(resource_id);
blob_unmap(resource_id);

for (i = 0; i < some_big_number; i++)
    *(uint8_t *)address = 0;

*(uint8_t *)address will dereference the blob until force_rcu is
triggered, so finalizing MemoryRegion before force_rcu results in use-
after-free.

The best option to eliminate the delay entirely I have in mind is to
call drain_call_rcu(), but I'm not for such a change (for now).
drain_call_rcu() eliminates the delay if the FlatView protected by RCU
is the only referrer of the MemoryRegion, but that is not guaranteed.

Performance should not be a concern anyway in this situation. The
guest should not waste CPU time by polling in the first place if you
really care performance; since it's a para-virtualized device and not
a real hardware, CPU time may be shared between the guest and the
device, and thus polling on the guest has an inherent risk of slowing
down the device. For performance-sensitive workloads, the guest should:

- avoid polling and
- accumulate commands instead of waiting for each

The delay will be less problematic if the guest does so, and I think
at least Linux does avoid polling.

That said, stalling the guest forever in this situation is "wrong" (!=
"bad performance"). I wrote this patch to guarantee forward progress,
which is mandatory for semantic correctness.

Perhaps drain_call_rcu() may make sense also in other, performance-
sensitive scenarios, but it should be added after benchmark or we will
have a immature optimization.

I first thought just adding drain_call_rcu() would work but apparently
it is not that simple. Adding drain_call_rcu() has a few problems:

- It drops the BQL, which should be avoided. Problems caused by
run_on_cpu(), which drops the BQL, was discussed on the list for a few
times and drain_call_rcu() may also suffer from them.

- It is less effective if the RCU thread enters g_usleep() before
drain_call_rcu() is called.

- It slows down readers due to the nature of drain_call_rcu().

So, if you know some workload that may suffer from the delay, it may be
a good idea to try them with the patches from Alex first, and then think
of a clean solution if it improves performance.

Thanks a lot for the clarification. I'm seeing occasional 10ms stalls
with this patch applied, still it's a huge improvement. Looking forward
to v2.

Just for (further) clarification, but 10ms stalls are present even without this patch (correct me if I'm wrong). I think the stalls need to be resolved with another patch instead of having v2 of this unless it is a regression.


In addition to a guest waiting for the virgl commands completion, QEMU
display updates on host are also blocked while unmapping cmd is
suspended. This is a noticeable problem for interactive GFX applications
running on guest.

I guess you meant that the scanout commands following unmapping commands are blocked. While I can imagine that can cause frames skipped and damage user experience, it is nice if you know reproduction cases or affected workloads and share them with me.

Regards,
Akihiko Odaki

Reply via email to