On Fri, Jun 5, 2026 at 8:51 PM Si-Wei Liu <[email protected]> wrote:
>
>
>
> On 6/5/2026 10:43 AM, Michael S. Tsirkin wrote:
> > On Fri, Jun 05, 2026 at 09:03:36AM -0700, Si-Wei Liu wrote:
> >>
> >> On 6/1/2026 11:04 PM, Eugenio Perez Martin wrote:
> >>> On Tue, Jun 2, 2026 at 6:34 AM yangjiale <[email protected]> wrote:
> >>>> When a descriptor list spans across cache lines,
> >>>> updating the flag first can lead to a scenario where the device side
> >>>> perceives the flag as valid, yet the corresponding address and length
> >>>> fields remain unupdated—resulting in invalid values.
> >>>> Therefore, the flag field must be updated last.
> >>>>
> >>>> Signed-off-by: yangjiale <[email protected]>
> >>>> ---
> >>>>    drivers/virtio/virtio_ring.c | 8 ++++----
> >>>>    1 file changed, 4 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> >>>> index fbca7ce1c6bf..036b4f90d30f 100644
> >>>> --- a/drivers/virtio/virtio_ring.c
> >>>> +++ b/drivers/virtio/virtio_ring.c
> >>>> @@ -1688,6 +1688,10 @@ static inline int virtqueue_add_packed(struct 
> >>>> vring_virtqueue *vq,
> >>>>                                                &addr, &len, premapped, 
> >>>> attr))
> >>>>                                   goto unmap_release;
> >>>>
> >>>> +                       desc[i].addr = cpu_to_le64(addr);
> >>>> +                       desc[i].len = cpu_to_le32(len);
> >>>> +                       desc[i].id = cpu_to_le16(id);
> >>>> +
> >>>>                           flags = 
> >>>> cpu_to_le16(vq->packed.avail_used_flags |
> >>>>                                       (++c == total_sg ? 0 : 
> >>>> VRING_DESC_F_NEXT) |
> >>>>                                       (n < out_sgs ? 0 : 
> >>>> VRING_DESC_F_WRITE));
> >>>> @@ -1696,10 +1700,6 @@ static inline int virtqueue_add_packed(struct 
> >>>> vring_virtqueue *vq,
> >>>>                           else
> >>>>                                   desc[i].flags = flags;
> >>>>
> >>>> -                       desc[i].addr = cpu_to_le64(addr);
> >>>> -                       desc[i].len = cpu_to_le32(len);
> >>>> -                       desc[i].id = cpu_to_le16(id);
> >>>> -
> >>>>                           if (unlikely(vq->use_map_api)) {
> >>>>                                   vq->packed.desc_extra[curr].addr = 
> >>>> premapped ?
> >>>>                                           DMA_MAPPING_ERROR : addr;
> >>> These flags are updated before the flags of the head descriptor at the
> >>> end of the function, at "vq->packed.vring.desc[head].flags =
> >>> head_flags", so the device should not see these. Because of that, the
> >>> relative order between the rest of the fields of the same descriptor
> >>> or other descriptors' fields, except for the head descriptor's flags,
> >>> should not matter. There is a write memory barrier just before
> >>> updating the head's flags.
> >> The above analysis is absolutely correct. Though one hardware vendor told 
> >> me
> >> that this driver implementation kinda stops them from reading ahead of
> >> descriptors already posted beyond the available index., ending up with
> >> suboptimal performance that is hard to make up by other means. Would it be 
> >> a
> >> bad idea to go with this change and add write barrier in a gentle way for a
> >> small flit in the batch, e.g. commit to memory after every cache line size
> >> worth of descriptors are posted? Would the memory barrier have negative
> >> performance overhead to other backend implementation variants than real
> >> hardware PCI device?
> >>
> >> -Siwei
> > this would need a new feature bit, won't it?
> Probably. This is to capture the device's expectation and behavior
> right? the driver change itself is not spec violating...
>
> >
> >>> Also, I don't get why the cache line matters here. Can you expand? Am
> >>> I missing something?
> > me too.
> >
> Just to avoid extra delay due to excessive coherency messages and
> frequent cache thrashing, device read over pci bus contends with host
> write/update on the descriptors in a same cache line..
>

Whether the descriptors are in the same cache line or not, how does
the device know that the memory for the other descriptors is updated
or dirty and needs to be read again? The only way I can imagine is to
force both the device and the driver to update the flags of all
descriptors, regardless of whether they're in a chain, and use that
information for synchronization. Also, the device must read these
flags strictly before the other members, as it does with the head's
flag now. I'm not sure if that memory dance beats the PCI latency.

I thought of something similar, not for cache thrashing but to save
the overhead and latency of the extra PCI read for the length and id
descriptor members. My understanding is that a PCI device can only
read 64 bits atomically at most. So it can only save one fetch of the
fields together with the flags (length and id) if the driver promises
to write all of them atomically. This needs a feature flag as MST
says.

Something like this super early draft:

VIRTIO_F_ATOMIC_64_FLAGS: The driver writes the length, id, and flags
of a packed descriptor atomically, ensuring they are always
synchronized. The device will not read them again once it finds the
descriptor available via its flags.

Conversely, the device could atomically update the descriptor ID along
with the flags, meaning the driver wouldn't need a memory barrier
between these updates.

I guess it does not buy much on x86 software devices, but it might
improve performance in architectures with less cache coherency. From
the driver's perspective, implementation isn't hard.

I also see 128-bit CAS PCI, but I'm not sure if the CPU can write the
128 bits of the descriptor atomically from the device's POV or if the
driver's write barrier is sufficient. Perhaps this is an improvement
for SW devices actually.

Adding Dragos to the thread.


Reply via email to