On Tue, Apr 09, 2024 at 01:02:52AM +0800, ni.liqiang wrote:
> In our testing of the virtio hardware accelerator, we found that
> configuring the flags of the descriptor after addr and len,
> as implemented in DPDK, seems to be more friendly to the hardware.
> 
> In our Virtio hardware implementation tests, using the default
> open-source code, the hardware's bulk reads ensure performance
> but correctness is compromised. If we refer to the implementation code
> of DPDK, placing the flags configuration of the descriptor
> after addr and len, virtio backend can function properly based on
> our hardware accelerator.
> 
> I am somewhat puzzled by this. From a software process perspective,
> it seems that there should be no difference whether
> the flags configuration of the descriptor is before or after addr and len.
> However, this is not the case according to experimental test results.


You should be aware of the following, from the PCI Express spec.
Note especially the second paragraph, and the last paragraph:

2.4.2.
25
30
Update Ordering and Granularity Observed by a
Read Transaction
If a Requester using a single transaction reads a block of data from a 
Completer, and the
Completer's data buffer is concurrently being updated, the ordering of multiple 
updates and
granularity of each update reflected in the data returned by the read is 
outside the scope of this
specification. This applies both to updates performed by PCI Express write 
transactions and
updates performed by other mechanisms such as host CPUs updating host memory.
If a Requester using a single transaction reads a block of data from a 
Completer, and the
Completer's data buffer is concurrently being updated by one or more entities 
not on the PCI
Express fabric, the ordering of multiple updates and granularity of each update 
reflected in the data
returned by the read is outside the scope of this specification.




As an example of update ordering, assume that the block of data is in host 
memory, and a host CPU
writes first to location A and then to a different location B. A Requester 
reading that data block
with a single read transaction is not guaranteed to observe those updates in 
order. In other words,
the Requester may observe an updated value in location B and an old value in 
location A, regardless
of the placement of locations A and B within the data block. Unless a Completer 
makes its own
guarantees (outside this specification) with respect to update ordering, a 
Requester that relies on
update ordering must observe the update to location B via one read transaction 
before initiating a
subsequent read to location A to return its updated value.




As an example of update granularity, if a host CPU writes a QWORD to host 
memory, a Requester
reading that QWORD from host memory may observe a portion of the QWORD updated 
and
another portion of it containing the old value.
While not required by this specification, it is strongly recommended that host 
platforms guarantee
that when a host CPU writes aligned DWORDs or aligned QWORDs to host memory, 
the update
granularity observed by a PCI Express read will not be smaller than a DWORD.


IMPLEMENTATION NOTE
No Ordering Required Between Cachelines
15
A Root Complex serving as a Completer to a single Memory Read that requests 
multiple cachelines
from host memory is permitted to fetch multiple cachelines concurrently, to 
help facilitate multi-
cacheline completions, subject to Max_Payload_Size. No ordering relationship 
between these
cacheline fetches is required.





Now I suspect that what is going on is that your Root complex
reads descriptors out of order, so the second descriptor is invalid
but the 1st one is valid.




> We would like to know if such a change in the configuration order
> is reasonable and acceptable?

We need to understand the root cause and how robust the fix is
before answering this.


> Thanks.
> 
> Signed-off-by: ni.liqiang <ni_liqi...@126.com>
> Reviewed-by: jin.qi <jin...@zte.com.cn>
> Tested-by: jin.qi <jin...@zte.com.cn>
> Cc: ni.liqiang <ni.liqi...@zte.com.cn>
> ---
>  drivers/virtio/virtio_ring.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 6f7e5010a673..bea2c2fb084e 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -1472,15 +1472,16 @@ static inline int virtqueue_add_packed(struct 
> virtqueue *_vq,
>                       flags = cpu_to_le16(vq->packed.avail_used_flags |
>                                   (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
>                                   (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
> -                     if (i == head)
> -                             head_flags = flags;
> -                     else
> -                             desc[i].flags = flags;
>  
>                       desc[i].addr = cpu_to_le64(addr);
>                       desc[i].len = cpu_to_le32(sg->length);
>                       desc[i].id = cpu_to_le16(id);
>  
> +                     if (i == head)
> +                             head_flags = flags;
> +                     else
> +                             desc[i].flags = flags;
> +
>                       if (unlikely(vq->use_dma_api)) {
>                               vq->packed.desc_extra[curr].addr = addr;
>                               vq->packed.desc_extra[curr].len = sg->length;
> -- 
> 2.34.1


Reply via email to