I've been experiencing some issues with the DPAA Ethernet driver,
specifically related to frame transmission. Hopefully you can point
me in the right direction.
TLDR: Attempting to transmit faster than a few frames per second causes
the TX FQ CGR to enter into the congested state and remain there forever,
even after transmission stops.
The hardware is a T2080RDB, running from the tip of net-next, using
the standard t2080rdb device tree and corenet64_smp_defconfig kernel
config. No changes were made to any of the files. The issue occurs
with 4.16.1 stable as well. In fact, the only time I've been able
to achieve reliable frame transmission was with the SDK 4.1 kernel.
For my tests, I'm running iperf3 both with and without the -R
option (send/receive). When using a USB Ethernet adapter, there
are no issues.
The issue is that it seems like the TX frame queues are getting
"stuck" when attempting to transmit at rates greater than a few frames
per second. Ping works fine, but it seems like anything that could
potentially cause multiple TX frames to be enqueued causes issues.
If I run iperf3 in reverse mode (with the T2080RDB receiving), then
I can achieve ~940 Mbps, but this is also somewhat unreliable.
If I run it with the T2080RDB transmitting, the test will never
complete. Sometimes it starts transmitting for a few seconds then stops,
and other times it never even starts. This also seems to force the
interface into a bad state.
The ethtool stats show that the interface has entered
congestion a few times, and that it's currently congested. The fact
that it's currently congested even after stopping transmission
indicates that the FQ somehow stopped being drained. I've also
noticed that whenever this issue occurs, the TX confirmation
counters are always less than the TX packet counters.
When it gets into this state, I can see that the memory usage is
climbing, up until about the point of where the CGR threshold
is (about 100 MB).
Any idea what could prevent the TX FQ from being drained? My first
guess was flow control, but it's completely disabled.
I tried messing with the egress congestion threshold, workqueue
assignments, etc., but nothing seemed to have any effect.
If you need any more information or want me to run any tests,
please let me know.
Jacob S. Moroni