On 6/1/2026 11:04 PM, Eugenio Perez Martin wrote:
On Tue, Jun 2, 2026 at 6:34 AM yangjiale <[email protected]> wrote:
When a descriptor list spans across cache lines,
updating the flag first can lead to a scenario where the device side
perceives the flag as valid, yet the corresponding address and length
fields remain unupdated—resulting in invalid values.
Therefore, the flag field must be updated last.
Signed-off-by: yangjiale <[email protected]>
---
drivers/virtio/virtio_ring.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index fbca7ce1c6bf..036b4f90d30f 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1688,6 +1688,10 @@ static inline int virtqueue_add_packed(struct
vring_virtqueue *vq,
&addr, &len, premapped, attr))
goto unmap_release;
+ desc[i].addr = cpu_to_le64(addr);
+ desc[i].len = cpu_to_le32(len);
+ desc[i].id = cpu_to_le16(id);
+
flags = cpu_to_le16(vq->packed.avail_used_flags |
(++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
(n < out_sgs ? 0 : VRING_DESC_F_WRITE));
@@ -1696,10 +1700,6 @@ static inline int virtqueue_add_packed(struct
vring_virtqueue *vq,
else
desc[i].flags = flags;
- desc[i].addr = cpu_to_le64(addr);
- desc[i].len = cpu_to_le32(len);
- desc[i].id = cpu_to_le16(id);
-
if (unlikely(vq->use_map_api)) {
vq->packed.desc_extra[curr].addr = premapped ?
DMA_MAPPING_ERROR : addr;
These flags are updated before the flags of the head descriptor at the
end of the function, at "vq->packed.vring.desc[head].flags =
head_flags", so the device should not see these. Because of that, the
relative order between the rest of the fields of the same descriptor
or other descriptors' fields, except for the head descriptor's flags,
should not matter. There is a write memory barrier just before
updating the head's flags.
The above analysis is absolutely correct. Though one hardware vendor
told me that this driver implementation kinda stops them from reading
ahead of descriptors already posted beyond the available index., ending
up with suboptimal performance that is hard to make up by other means.
Would it be a bad idea to go with this change and add write barrier in a
gentle way for a small flit in the batch, e.g. commit to memory after
every cache line size worth of descriptors are posted? Would the memory
barrier have negative performance overhead to other backend
implementation variants than real hardware PCI device?
-Siwei
Also, I don't get why the cache line matters here. Can you expand? Am
I missing something?