On Sun, Nov 27, 2011 at 6:17 PM, Outback Dingo <[email protected]> wrote:
> On Sun, Nov 27, 2011 at 11:52 AM, Otto Solares Cabrera <[email protected]> wrote:
>> On Sat, Nov 26, 2011 at 10:37:33PM -0500, Outback Dingo wrote:
>>> On Sat, Nov 26, 2011 at 10:13 PM, Hartmut Knaack <[email protected]> wrote:
>>> > This patch brings support for kernel version 3.1 to the ar71xx platform. 
>>> > It is based on Otto Estuardo Solares Cabreras linux-3.0 patches, with 
>>> > some changes to keep up with recent filename changes in the kernel. 
>>> > Minimum kernel version seems to be 3.1.1, otherwise one of the generic 
>>> > patches will fail. Successfully tested with kernel 3.1.2 on a WR1043ND. 
>>> > Kernel version in the Makefile still needs to be adjusted manually.
>>>
>>> ill get onto testing these also
>>
>> It works for me on the wrt160nl with Linux-3.1.3. Thx Hartmut!
>
> Also working on WNDR3700v2 and a variety of Ubiquiti gear.... nice....
> Thanks both of you.

My thanks as well, although I haven't had time to do a build yet. IF
anyone is interested in
byte queue limits, the patches I was attempting to backport to 3.1
before taking off for the holiday,
including a modified ag71xx driver, are at:

http://huchra.bufferbloat.net/~cero1/bql/

Regettably they didn't quite compile before I left for holiday, and
I'm going to have to rebase cerowrt and rebuild, (I'm still grateful!)
and I figure (hope!) one of you folk will beat me to getting BQL working
before I get  back to the office tuesday.

A plug:

Byte queue limits hold great promise for beating bufferbloat, and getting
tc's shapers and schedulers to work properly again, at least
on ethernet.

Byte Queue limits, by holding down the amount of outstanding data that
the device driver
has in it, all the QoS and shaping tools that we know and love finally
get a chance to work again. You can retain high hw tx queue rings - so, as
an example, you could have a 6k byte queue limit and 4 large packets
in the buffer,
or 93 ack packets in the buffer - and this let you manage the bandwidth via
tools higher in the stack, as either take about the same amount of
time to transmit,
without compromising line level performance...

The current situation is: we often have hw tx rings of 64 or higher,
which translates out to
96k in flight, meaning that (as already demonstrated) with this patch working,
you can improve network responsiveness by a factor of at least ten, perhaps as
much as 100. (TCP's response to buffering is quadratic, not linear,
but there are other
variables, so... factor 10 sounds good, doesn't it?)

>From Tom Herbert's announcement (there was much feedback on netdev, I
would expect
another revision to come by)


Changes from last version:
 - Rebase to 3.2
 - Added CONFIG_BQL and CONFIG_DQL
 - Added some cache alignment in struct dql, to split read only, writeable
   elements, and split those elements written on transmit from those
   written at transmit completion (suggested by Eric).
 - Split out adding xps_queue_release as its own patch.
 - Some minor performance changes, use likely and unlikely for some
   conditionals.
 - Cleaned up some "show" functions for bql (pointed out by Ben).
 - Change netdev_tx_completed_queue to do check xoff, check
   availability, and then check xoff again.  This to prevent potential
   race conditions with netdev_sent_queue (as Ben pointed out).
 - Did some more testing trying to evaluate overhead of BQL in the
   transmit path.  I see about 1-3% degradation in CPU utilization
   and maximum pps when BQL is enabled.  Any ideas to beat this
   down as much as possible would be appreciated!
 - Added high versus low priority traffic test to results below.

----

This patch series implements byte queue limits (bql) for NIC TX queues.

Byte queue limits are a mechanism to limit the size of the transmit
hardware queue on a NIC by number of bytes. The goal of these byte
limits is too reduce latency (HOL blocking) caused by excessive queuing
in hardware (aka buffer bloat) without sacrificing throughput.

Hardware queuing limits are typically specified in terms of a number
hardware descriptors, each of which has a variable size. The variability
of the size of individual queued items can have a very wide range. For
instance with the e1000 NIC the size could range from 64 bytes to 4K
(with TSO enabled). This variability makes it next to impossible to
choose a single queue limit that prevents starvation and provides lowest
possible latency.

The objective of byte queue limits is to set the limit to be the
minimum needed to prevent starvation between successive transmissions to
the hardware. The latency between two transmissions can be variable in a
system. It is dependent on interrupt frequency, NAPI polling latencies,
scheduling of the queuing discipline, lock contention, etc. Therefore we
propose that byte queue limits should be dynamic and change in
accordance with networking stack latencies a system encounters.  BQL
should not need to take the underlying link speed as input, it should
automatically adjust to whatever the speed is (even if that in itself is
dynamic).

Patches to implement this:
- Dynamic queue limits (dql) library.  This provides the general
queuing algorithm.
- netdev changes that use dlq to support byte queue limits.
- Support in drivers for byte queue limits.

The effects of BQL are demonstrated in the benchmark results below.

--- High priority versus low priority traffic:

In this test 100 netperf TCP_STREAMs were started to saturate the link.
A single instance of a netperf TCP_RR was run with high priority set.
Queuing discipline in pfifo_fast, NIC is e1000 with TX ring size set to
1024.  tps for the high priority RR is listed.

No BQL, tso on: 3000-3200K bytes in queue: 36 tps
BQL, tso on: 156-194K bytes in queue, 535 tps
No BQL, tso off: 453-454K bytes int queue, 234 tps
BQL, tso off: 66K bytes in queue, 914 tps

---  Various RR sizes

These tests were done running 200 stream of netperf RR tests.  The
results demonstrate the reduction in queuing and also illustrates
the overhead due to BQL (in small RR sizes).

140000 rr size
BQL: 80-215K bytes in queue, 856 tps, 3.26%
No BQL: 2700-2930K bytes in queue, 854 tps, 3.71% cpu

14000 rr size
BQL: 25-55K bytes in queue, 8500 tps
No BQL: 1500-1622K bytes in queue,  8523 tps, 4.53% cpu

1400 rr size
BQL: 20-38K in queue bytes in queue, 86582 tps,  7.38% cpu
No BQL: 29-117K 85738 tps, 7.67% cpu

140 rr size
BQL: 1-10K bytes in queue, 320540 tps, 34.6% cpu
No BQL: 1-13K bytes in queue, 323158, 37.16% cpu

1 rr size
BQL: 0-3K in queue, 338811 tps, 41.41% cpu
No BQL: 0-3K in queue, 339947 42.36% cpu

So the amount of queuing in the NIC can be reduced up to 90% or more.
Accordingly, the latency for high priority packets in the prescence
of low priority bulk throughput traffic can be reduced by 90% or more.

Since BQL accounting is in the transmit path for every packet, and the
function to recompute the byte limit is run once per transmit
completion-- there will be some overhead in using BQL.  So far, Ive see
the overhead to be in the range of 1-3% for CPU utilization and maximum
pps.




-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net
_______________________________________________
openwrt-devel mailing list
[email protected]
https://lists.openwrt.org/mailman/listinfo/openwrt-devel

Reply via email to