On 1 Sep, 2014, at 11:25 pm, Aaron Wood wrote:

>>> But this doesn't really answer the question of why the WNDR has so much 
>>> lower a ceiling with shaping than without.  The G4 is powerful enough that 
>>> the overhead of shaping simply disappears next to the overhead of shoving 
>>> data around.  Even when I turn up the shaping knob to a value quite close 
>>> to the hardware's unshaped capabilities (eg. 400Mbps one-way), most of the 
>>> shapers stick to the requested limit like glue, and even the worst offender 
>>> is within 10%.  I estimate that it's using only about 500 clocks per packet 
>>> *unless* it saturates the PCI bus.
>>> 
>>> It's possible, however, that we're not really looking at a CPU limitation, 
>>> but a timer problem.  The PowerBook is a "proper" desktop computer with 
>>> hardware to match (modulo its age).  If all the shapers now depend on the 
>>> high-resolution timer, how high-resolution is the WNDR's timer?

>> Both good questions worth further exploration.

> Doing some napkin math and some spec reading, I think that the memory bus is 
> a likely factory.  The G4 had a fairly impressive memory bus for the day 
> (64-bit?).  The WNDR3800 appears to be used in an x16 configuration (based on 
> the numbers on the memory parts).  It may have *just* enough bw to push 
> concurrent 3x3 802.11n through the software bridge interface, which 
> short-circuits a lot of processing (IIRC).   
> 
> The typical way I've seen a home router being benchmarked for the "marketing 
> numbers" is to flow tcp data to/from a wifi client to a wired client.  Single 
> socket is used, for a uni-directional stream of data.  So long as they can 
> hit peak rates (peak MCS), it will get marked as good for "up to 900Mbps!!" 
> or whatever they want to say.
> 
> The small cache of the AR7161 vs. the G4 is another issue (32KB vs. 2MB) the 
> various buffers for fq_codel and htb may stay in L2 on the G4, but there 
> simply isn't room in the AR7161 for that, which puts further pressure on the 
> bus.

I don't think that's it.

First a nitpick: the PowerBook version of the late-model G4 (7447A) doesn't 
have the external L3 cache interface, so it only has the 256KB or 512KB 
internal L2 cache (I forget which).  The desktop version (7457A) used external 
cache.  The G4 was considered to be *crippled* by its FSB by the end of its 
run, since it never adopted high-performance signalling techniques, nor moved 
the memory controller on-die; it was quoted that the G5 (970) could move data 
using *single-byte* operations faster than the *peak* throughput of the G4's 
FSB.  The only reason the G5 never made it into a PowerBook was because it 
wasn't battery-friendly in the slightest.

But that makes little difference to your argument - compared to a cheap 
CPE-class embedded SoC, the PowerBook is eminently desktop-class hardware, even 
if it is already a decade old.

More compelling is that even at 16-bit width, the WNDR's RAM should have more 
bandwidth than my PowerBook's PCI bus.  Standard PCI is 33MHz x 32-bit, and I 
can push a steady 30MB/sec in both directions simultaneously, which corresponds 
in total to about half the PCI bus's theoretical capacity.  (The GEM reports 
66MHz capability, but it shares the bus with an IDE controller which doesn't, 
so I assume it is stuck at 33MHz.)  A 16-bit RAM should be able to match PCI if 
it runs at 66MHz, which is the lower limit of JEDEC standards for SDRAM.

The AR7161 datasheet says it has a DDR-capable SDRAM interface, which implies 
at least 200MHz unless the integrator was colossally stingy.  Further, a little 
digging suggests that the memory bus should be 32-bit wide (hence two 16-bit 
RAM chips), and that the WNDR runs it at 340MHz, half the CPU core speed.  For 
an embedded SoC, that's really not too bad - it should be able to sustain 
1GB/sec, in one direction at a time.

So that takes care of the argument for simply moving the payload around.  In 
any case, the WNDR demonstrably *can* cope with the available bandwidth if the 
shaping is turned off.

For the purposes of shaping, the CPU shouldn't need to touch the majority of 
the payload - only the headers, which are relatively small.  The bulk of the 
payload should DMA from one NIC to RAM, then DMA back out of RAM to the other 
NIC.  It has to do that anyway to route them, and without shaping there'd be 
more of them to handle.  The difference might be in the data structures used by 
the shaper itself, but I think those are also reasonably compact.  It doesn't 
even have to touch userspace, since it's not acting as the endpoint as my 
PowerBook was during my tests.

And while the MIPS 24K core is old, it's also been die-shrunk over the 
intervening years, so it runs a lot faster than it originally did.  I very much 
doubt that it's as refined as my G4, but it could probably hold its own 
relative to a comparable ARM SoC such as the Raspberry Pi.  (Unfortunately, the 
latter doesn't have the I/O capacity to do high-speed networking - USB only.)  
Atheros publicity materials indicate that they increased the I-cache to 64KB 
for performance reasons, but saw no need to increase the D-cache at the same 
time.

Which brings me back to the timers, and other items of black magic.

Incidentally, transfer speed benchmarks involving wireless will certainly be 
limited by the wireless link.  I assume that's not a factor here.

 - Jonathan Morton

_______________________________________________
Bloat mailing list
[email protected]
https://lists.bufferbloat.net/listinfo/bloat

Reply via email to