On 8 Aug 2019, at 17:38, Ilya Maximets wrote:

<SNIP>

I see a rather high number of afxdp_cq_skip, which should to my knowledge never happen?

I tried to investigate this previously, but didn't find anything suspicious.
So, for my knowledge, this should never happen too.
However, I only looked at the code without actually running, because I had no
HW available for testing.

While investigation and stress-testing virtual ports I found few issues with missing locking inside the kernel, so there is no trust for kernel part of XDP implementation from my side. I'm suspecting that there are some other bugs in
kernel/libbpf that only could be reproduced with driver mode.

This never happens for virtual ports with SKB mode, so I never saw this coverage
counter being non-zero.

Did some quick debugging, as something else has come up that needs my attention :)

But once I’m in a faulty state and sent a single packet, causing afxdp_complete_tx() to be called, it tells me 2048 descriptors are ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that there might be some ring management bug. Maybe consumer and receiver are equal meaning 0 buffers, but it returns max? I did not look at the kernel code, so this is just a wild guess :)

(gdb) p tx_done
$3 = 2048

(gdb) p umem->cq
$4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask = 2047, size = 2048, producer = 0x7f08486b8000, consumer = 0x7f08486b8040, ring = 0x7f08486b8080}

Thanks for debugging!

xsk_ring_cons__peek() just returns the difference between cached_prod
and cached_cons, but these values are too different:

3830466864 - 3578066899 = 252399965

Since this value > requested, it returns requested number (2048).

So, the ring is broken. At least broken its 'cached' part. It'll be good
to look at *consumer and *producer values to verify the state of the
actual ring.


I’ll try to find some more time next week to debug further.

William I noticed your email in xdp-newbies where you mention this problem of getting the wrong pointers. Did you ever follow up, or did further trouble shooting on the above?



$ ovs-appctl coverage/show  | grep xdp
afxdp_cq_empty             0.0/sec   339.600/sec        5.6606/sec   total: 20378 afxdp_tx_full              0.0/sec    29.967/sec        0.4994/sec   total: 1798 afxdp_cq_skip              0.0/sec 61884770.167/sec  1174238.3644/sec   total: 4227258112


You mentioned you saw this high number in your v15 change notes, did you do any research on why?

Cheers,

Eelco


_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to