Hi,

On 7/31/25 7:22 PM, Eugenio Perez Martin wrote:
On Wed, Jul 30, 2025 at 4:33 PM Sahil Siddiq <icegambi...@gmail.com> wrote:
I think I have finally found the reason behind this issue.

The order in which "add_packed" and "get_buf_packed" are performed in the
nested guest kernel (L2 kernel) and QEMU are a little different. Due to
this, the values in free_head and svq->desc_next[] differ and the guest
crashes at some point. More below.

On 6/26/25 1:07 PM, Eugenio Perez Martin wrote:
On Thu, Jun 26, 2025 at 7:16 AM Sahil Siddiq <icegambi...@gmail.com> wrote:
I think there's something off in the way "free_head", "last_used_idx" and
"desc_next" values are calculated in vhost_svq_get_buf_packed() [1].

In the latest test run, QEMU sent ids 0 through 28 to L2. L2 started receiving
them in order till id 8. At this point it received id 7 again for some reason
and then crashed.

L2:

[ 1641.129218] (prepare_packed) output.0 -> needs_kick: 1
[ 1641.130621] (notify) output.0 -> function will return true
[ 1641.132022] output.0 -> id: 0
[ 1739.502358] input.0 -> id: 0
[ 1739.503003] input.0 -> id: 1
[ 1739.562024] input.0 -> id: 2
[ 1739.578682] input.0 -> id: 3
[ 1739.661913] input.0 -> id: 4
[ 1739.828796] input.0 -> id: 5
[ 1739.829789] input.0 -> id: 6
[ 1740.078757] input.0 -> id: 7
[ 1740.079749] input.0 -> id: 8
[ 1740.080382] input.0 -> id: 7    <----Received 7 again
[ 1740.081614] virtio_net virtio1: input.0:id 7 is not a head!

QEMU logs (vhost_svq_get_buf_packed):
------
size              : svq->vring.num
len               : svq->vring_packed.vring.desc[last_used].len
id                : svq->vring_packed.vring.desc[last_used].id
num               : svq->desc_state[id].ndescs
last_used_chain   : Result of vhost_svq_last_desc_of_chain(svq, num, id) [2]
free_head         : svq->free_head
last_used         : (last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)) + 
num
used_wrap_counter : !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR))
------
size: 256, len: 82, id: 7, vq idx: 0
id: 7, last_used_chain: 7, free_head: 6, vq idx: 0
num: 1, free_head: 7, id: 7, last_used: 8, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 104, id: 8, vq idx: 0
id: 8, last_used_chain: 8, free_head: 7, vq idx: 0
num: 1, free_head: 8, id: 8, last_used: 9, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 98, id: 9, vq idx: 0
id: 9, last_used_chain: 9, free_head: 8, vq idx: 0
num: 1, free_head: 9, id: 9, last_used: 10, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 104, id: 10, vq idx: 0
id: 10, last_used_chain: 10, free_head: 9, vq idx: 0
num: 1, free_head: 10, id: 10, last_used: 11, used_wrap_counter: 1, vq idx: 0

I have a few more ideas of what to do. I'll let you know if I find something
else.

I cannot find anything just by inspection. What about printing all the
desc_state and all desc_next to check for incoherencies in each
svq_add and get_buf?
In this test, all 256 descriptors were filled in the RX vq.

In the TX queue, L2 kernel would add one descriptor at a time and notify
QEMU. QEMU would then register it in its SVQ and mark it as "available".
After processing the descriptor, QEMU would mark it as "used" and flush it
back to L2. L2, in turn, would mark this descriptor as "used". After this
process, L2 would add the next descriptor in the TX vq while reusing this
ID. This was observed from idx 0 till idx 7.

L2's debug logs:

[   18.379112] (use_indirect?) output.0 -> verdict: 0                <----- 
Begin adding descriptor in idx 6
[   18.387134] (add_packed) output.0 -> idx: 6
[   18.389897] (add_packed) output.0 -> id: 0
[   18.392290] (add_packed) output.0 -> len: 74
[   18.394606] (add_packed) output.0 -> addr: 5012315726
[   18.397043] (add_packed) output.0 -> next id: 1
[   18.399861] Entering prepare_packed: output.0
[   18.402478] (prepare_packed) output.0 -> needs_kick: 1
[   18.404998] (notify) output.0 -> function will return true        <----- 
Notify QEMU
[   18.406349] output.0 -> id: 0, idx: 6                             <----- 
Mark ID 0 in idx 6 as used
[   18.409482] output.0 -> old free_head: 1, new free_head: 0        <----- ID 
0 can be reused
[   18.410919] (after get_buf processed) output.0 -> id: 0, idx: 7   <----- 
Next slot is idx 7
[   18.921895] (use_indirect?) output.0 -> verdict: 0                <----- 
Begin adding descriptor with ID = 0 in idx 7
[   18.930093] (add_packed) output.0 -> idx: 7
[   18.935715] (add_packed) output.0 -> id: 0
[   18.937609] (add_packed) output.0 -> len: 122
[   18.939614] (add_packed) output.0 -> addr: 4925868038
[   18.941710] (add_packed) output.0 -> next id: 1
[   18.944032] Entering prepare_packed: output.0
[   18.946148] (prepare_packed) output.0 -> needs_kick: 1
[   18.948234] (notify) output.0 -> function will return true        <----- 
Notify QEMU
[   18.949606] output.0 -> id: 0, idx: 7                             <----- 
Mark ID 0 in idx 7 as used
[   18.952756] output.0 -> old free_head: 1, new free_head: 0        <----- ID 
0 can be reused
[   18.955154] (after get_buf processed) output.0 -> id: 0, idx: 8   <----- 
Next slot is idx 8

There was no issue in QEMU till this point.

[   19.177536] (use_indirect?) output.0 -> verdict: 0                <----- 
Begin adding descriptor with ID = 0 in idx 8
[   19.182415] (add_packed) output.0 -> idx: 8
[   19.187257] (add_packed) output.0 -> id: 0
[   19.191355] (add_packed) output.0 -> len: 102
[   19.195131] (add_packed) output.0 -> addr: 4370702342
[   19.199224] (add_packed) output.0 -> next id: 1
[   19.204929] Entering prepare_packed: output.0
[   19.209505] (prepare_packed) output.0 -> needs_kick: 1
[   19.213820] (notify) output.0 -> function will return true       <----- 
Notify QEMU
[   19.218792] (use_indirect?) output.0 -> verdict: 0               <----- Next 
slot is idx 9
[   19.224730] (add_packed) output.0 -> idx: 9
[   19.227067] (add_packed) output.0 -> id: 1                       <----- ID 0 
can't be reused yet, so use ID = 1
[   19.229090] (add_packed) output.0 -> len: 330
[   19.231182] (add_packed) output.0 -> addr: 4311020614
[   19.233302] (add_packed) output.0 -> next id: 2
[   19.235620] Entering prepare_packed: output.0
[   19.237781] (prepare_packed) output.0 -> needs_kick: 1
[   19.239958] (notify) output.0 -> function will return true       <----- 
Notify QEMU
[   19.237780] output.0 -> id: 0, idx: 8                            <----- Mark 
ID 0 in idx 8 as used
[   19.243676] output.0 -> old free_head: 2, new free_head: 0       <----- ID 0 
can now be reused
[   19.245214] (after get_buf processed) output.0 -> id: 0, idx: 9  <----- Next 
slot is idx 9
[   19.247097] output.0 -> id: 1, idx: 9                            <----- Mark 
ID 1 in idx 9 as used
[   19.249612] output.0 -> old free_head: 0, new free_head: 1       <----- ID 1 
can now be reused
[   19.252266] (after get_buf processed) output.0 -> id: 1, idx: 10 <----- Next 
slot is idx 10

ID 0 and ID 1 in idx 8 and idx 9 respectively are pushed to QEMU
before either of them are marked as used.

But in QEMU, the order is slightly different.

num: 1, init_flags: 128                                                        
<----- vhost_svq_add_packed()
idx: 8, id: 0, len: 0, flags: 0, vq idx: 1                                     
<----- Before adding descriptor
idx: 8, id: 0, len: 102, flags: 128, vq idx: 1                                 
<----- After adding descriptor
Finally: new_idx: 9, head_idx: 8, id: 0, len: 102, flags: 128, vq idx: 1
svq->vring.num: 256                                                            
<----- Begin vhost_svq_get_buf_packed()
descriptor_len: 0
descriptor_id: 0                                                               
<----- Mark ID = 0 as used
last_used: 8                                                                   
<----- Processing idx 8
used_wrap_counter: 1
svq->desc_state[id].ndescs: 1
free_head: 0                                                                   
<----- Update free_head to 0.
last_used: 9                                                                   
<----- Update last_used to 9.
vq idx: 1                                                                      
<----- End vhost_svq_get_buf_packed()
i: 0                                                                           
<----- vhost_svq_flush()
descriptor_len: 0
elem->len: 22086
i: 1
elem_is_null: 1
vq idx: 1                                                                      
<----- End vhost_svq_flush()
num: 1, init_flags: 128                                                        
<----- vhost_svq_add_packed()
idx: 9, id: 0, len: 0, flags: 0, curr: 0, vq idx: 1                            
<----- Before adding descriptor
idx: 9, id: 0, len: 330, flags: 128, curr: 1, vq idx: 1                        
<----- After adding descriptor
Finally: new_idx: 10, head_idx: 9, id: 0, len: 330, flags: 128, vq idx: 1      
<----- ID 0 has been reused (versus ID 1 in L2)
svq->vring.num: 256                                                            
<----- Begin vhost_svq_get_buf_packed()
descriptor_len: 0
descriptor_id: 0                                                               
<----- Mark ID = 0 as used
last_used: 9                                                                   
<----- Processing idx 9
used_wrap_counter: 1
svq->desc_state[id].ndescs: 1
free_head: 0                                                                   
<----- Update free_head to 0.
last_used: 10                                                                  
<----- Update last_used to 10.
vq idx: 1                                                                      
<----- End vhost_svq_get_buf_packed()
i: 0                                                                           
<----- vhost_svq_flush()
descriptor_len: 0
elem->len: 22086
i: 1
elem_is_null: 1
vq idx: 1                                                                      
<----- End vhost_svq_flush()

In QEMU, id 0 is added in idx 8. But it's marked as used before a
descriptor can be added in idx 9. Because of this there's a discrepancy
in the value of free_head and in svq->desc_next.

In the current implementation, the values of ID are generated, maintained
and processed by QEMU instead of reading from the guest's memory. I think
reading the value of ID from the guest memory (similar to reading the
descriptor length from guest memory) should resolve this issue.


Ok you made a good catch here :).

The 1:1 sync is hard to achieve as a single buffer in the guest may
need to be splitted in many buffers in the host.

The alternative would be to ensure that "add_packed" and "get_buf_packed"
are synchronized between the guest and QEMU.

Yes, they're synchronized. When the guest makes an available
descriptor, its head is saved in the VirtQueueElement of the SVQ's
head idx on svq->desc_state.

Reviewing patch 3/7 I see you're actually returning the id of the
first descriptor of the chain in *head, while it should be the id of
the *last* descriptor. It should not be the cause of the failure, as I
don't see any descriptor chain in the log.

Does this mean if the current free_head is 3 and the length of the chain
is 4, then ID 7 should be saved in the descriptor ring? In the current
implementation, *all* descriptors in the chain are being assigned the same
ID (= free_head instead of free_head + length of chain).

To keep the free linked list happy we may need to store the head of the
descriptor chain in the vq too.

Now, why is SVQ id 0 being reused? Sounds like free_list is not
initialized to 0, 1, 2... but to something else like 0, 0, 0, etc. Can
you print the whole list in each iteration?


The free_list initially has the following values:
index 0 -> 1
index 1 -> 2
...
index 254 -> 255
index 255 -> 0

free_head is set to 0 at the beginning. When add_packed() executes for
the first time ID 0 is used and free_head is set to 1. If get_buf_packed()
is run immediately after the "add" operation, free_list[ID] is set to
the current free_head (ID 1 in this case). After this free_head is set to
the used ID (0 in this case).

So, free_list still looks like this:
index 0 -> 1
index 1 -> 2
...
index 254 -> 255
index 255 -> 0

But the free_head is 0. So, ID = 0 is reused again.

When 2 IDs (such as with idx 8 and idx 9 in the guest) are added to the
SVQ without either being used, then free_head is updated as shown:

free_head = 0 -> 1 -> 2

And then marking both IDs as "used" results in:

free_list[0] = current free_head (= 2)
free_head = 0
free_list[1] = current free_head (= 0)
free_head = 1

So free_list looks like this:

index 0 -> 2
index 1 -> 0
index 2 -> 3
index 3 -> 4
...
index 254 -> 255
index 255 -> 0

None of the indices in free_list hold the value 1.

Since at this point free_head is 1, ID 1 is now used very frequently
if add_packed() and get_buf_used() run one at a time in an interleaved
fashion.

Thanks,
Sahil

Reply via email to