On 8 Aug 2019, at 17:38, Ilya Maximets wrote:
<SNIP>
I see a rather high number of afxdp_cq_skip, which should to my
knowledge never happen?
I tried to investigate this previously, but didn't find anything
suspicious.
So, for my knowledge, this should never happen too.
However, I only looked at the code without actually running, because
I had no
HW available for testing.
While investigation and stress-testing virtual ports I found few
issues with
missing locking inside the kernel, so there is no trust for kernel
part of XDP
implementation from my side. I'm suspecting that there are some
other bugs in
kernel/libbpf that only could be reproduced with driver mode.
This never happens for virtual ports with SKB mode, so I never saw
this coverage
counter being non-zero.
Did some quick debugging, as something else has come up that needs my
attention :)
But once I’m in a faulty state and sent a single packet, causing
afxdp_complete_tx() to be called, it tells me 2048 descriptors are
ready, which is XSK_RING_PROD__DEFAULT_NUM_DESCS. So I guess that
there might be some ring management bug. Maybe consumer and receiver
are equal meaning 0 buffers, but it returns max? I did not look at
the kernel code, so this is just a wild guess :)
(gdb) p tx_done
$3 = 2048
(gdb) p umem->cq
$4 = {cached_prod = 3830466864, cached_cons = 3578066899, mask =
2047, size = 2048, producer = 0x7f08486b8000, consumer =
0x7f08486b8040, ring = 0x7f08486b8080}
Thanks for debugging!
xsk_ring_cons__peek() just returns the difference between cached_prod
and cached_cons, but these values are too different:
3830466864 - 3578066899 = 252399965
Since this value > requested, it returns requested number (2048).
So, the ring is broken. At least broken its 'cached' part. It'll be
good
to look at *consumer and *producer values to verify the state of the
actual ring.
I’ll try to find some more time next week to debug further.
William I noticed your email in xdp-newbies where you mention this
problem of getting the wrong pointers. Did you ever follow up, or did
further trouble shooting on the above?
$ ovs-appctl coverage/show | grep xdp
afxdp_cq_empty 0.0/sec
339.600/sec 5.6606/sec total: 20378
afxdp_tx_full 0.0/sec
29.967/sec 0.4994/sec total: 1798
afxdp_cq_skip 0.0/sec 61884770.167/sec
1174238.3644/sec total: 4227258112
You mentioned you saw this high number in your v15 change notes,
did you do any research on why?
Cheers,
Eelco
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev