On 2 Sep, 2014, at 1:14 am, Aaron Wood wrote:
>> For the purposes of shaping, the CPU shouldn't need to touch the majority of
>> the payload - only the headers, which are relatively small. The bulk of the
>> payload should DMA from one NIC to RAM, then DMA back out of RAM to the
>> other NIC. It has to do that anyway to route them, and without shaping
>> there'd be more of them to handle. The difference might be in the data
>> structures used by the shaper itself, but I think those are also reasonably
>> compact. It doesn't even have to touch userspace, since it's not acting as
>> the endpoint as my PowerBook was during my tests.
>
> In an ideal case, yes. But is that how this gets managed? (I have no idea,
> I'm certainly not a kernel developer).
It would be monumentally stupid to integrate two GigE MACs onto an SoC, and
then to call it a "network processor", without adequate DMA support. I don't
think Atheros are that stupid.
Here's a more detailed datasheet:
http://pdf.datasheetarchive.com/indexerfiles/Datasheets-SW6/DSASW00118777.pdf
"Another memory factor is the ability to support multiple I/O operations in
parallel via the WNPU's various ports. The on-chip SRAM in AR7100 WNPUs has 5
ports that enable simultaneous access to and from five sources: the two gigabit
Ethernet ports, the PCI port, the USB 2.0 port and the MIPS processor."
It's a reasonable question, however, whether the driver uses that support
properly. Mainline Linux kernel code seems to support the SoC but not the
Ethernet; if it were just a minor variant of some other Atheros hardware, I'd
have expected to see it integrated into one of the existing drivers. Or maybe
it is, and my greps just aren't showing it.
At minimum, however, there are MMIO ranges reported for each MAC during
OpenWRT's boot sequence. That's where the ring buffers are. The most the CPU
has to do is read each packet from RAM and write it into those buffers, or vice
versa for receive - I think that's what my PowerBook has to do. Ideally, a
bog-standard DMA engine would take over that simple duty. Either way, that's
something that has to happen whether it's shaped or not, so it's unlikely to be
our problem.
The same goes for the wireless MACs, incidentally. These are standard ath9k
mini-PCI cards, and the drivers *are* in mainline. There shouldn't be any
surprises with them.
> If the packet data is getting moved about from buffer to buffer (for instance
> to do the htb calculations?) could that substantially change the processing
> load?
The qdiscs only deal with packet and socket headers, not the full packet data.
Even then, they largely pass pointers around, inserting the headers into linked
lists rather than copying them into arrays. I believe a lot of attention has
been directed at cache-friendliness in this area, and the MIPS caches are of
conventional type.
>> Which brings me back to the timers, and other items of black magic.
>
> Which would point to under-utilizing the processor core, while still having
> high load? (I'm not seeing that, I'm curious if that would be the case).
It probably wouldn't manifest as high system load. Rather, poor timer
resolution or latency would show up as excessive delays between packets, during
which the CPU is idle. The packet egress times may turn out to be quantised -
that would be a smoking gun, if detectable.
>> Incidentally, transfer speed benchmarks involving wireless will certainly be
>> limited by the wireless link. I assume that's not a factor here.
>
> That's the usual suspicion. But these are RF-chamber, short-range lab setups
> where the radios are running at full speed in perfect environments...
Sure. But even turbocharged 'n' gear tops out at 450Mbps signalling, and much
less than that is available even theoretically for TCP/IP throughput. My point
is that you're probably not running *your* tests over wireless.
> What this makes me realize is that I should go instrument the cpu stats with
> each of the various operating modes:
>
> * no shaping, anywhere
> * egress shaping
> * egress and ingress shaping at various limited levels:
> * 10Mbps
> * 20Mbps
> * 50Mbps
> * 100Mbps
Smaller increments at the high end of the range may prove to be useful. I
would expect the CPU usage to climb nonlinearly (busy-waiting) if there's a
bottleneck in a peripheral device, such as the PCI bus. The way the kernel
classifies that usage may also be revealing.
> Heck, what about running HTB simply from a 1ms timer instead of from a data
> driven timer?
That might be what's already happening. We have to figure out that before we
can work out a solution.
- Jonathan Morton
_______________________________________________
Bloat mailing list
[email protected]
https://lists.bufferbloat.net/listinfo/bloat