[Cake] Does the latest cake support "tc filter"?

2018-05-16 Thread Fushan Wen
Hello developers,
I've seen the mail in the netdev mailing list, saying "other tc
filters supported". So can I use "tc filter" to attach specified
traffic to a specified tin without DSCP marks? It's helpful when
dealing with ingress traffic where iptables DSCP mark won't work.
Thanks in advance.
___
Cake mailing list
Cake@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cake


Re: [Cake] [PATCH net-next v12 2/7] sch_cake: Add ingress mode

2018-05-16 Thread Toke Høiland-Jørgensen
Cong Wang  writes:

> On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen  wrote:
>> +   if (tb[TCA_CAKE_AUTORATE]) {
>> +   if (!!nla_get_u32(tb[TCA_CAKE_AUTORATE]))
>> +   q->rate_flags |= CAKE_FLAG_AUTORATE_INGRESS;
>> +   else
>> +   q->rate_flags &= ~CAKE_FLAG_AUTORATE_INGRESS;
>> +   }
>> +
>> +   if (tb[TCA_CAKE_INGRESS]) {
>> +   if (!!nla_get_u32(tb[TCA_CAKE_INGRESS]))
>> +   q->rate_flags |= CAKE_FLAG_INGRESS;
>> +   else
>> +   q->rate_flags &= ~CAKE_FLAG_INGRESS;
>> +   }
>> +
>> if (tb[TCA_CAKE_MEMORY])
>> q->buffer_config_limit = nla_get_u32(tb[TCA_CAKE_MEMORY]);
>>
>> @@ -1559,6 +1628,14 @@ static int cake_dump(struct Qdisc *sch, struct 
>> sk_buff *skb)
>> if (nla_put_u32(skb, TCA_CAKE_MEMORY, q->buffer_config_limit))
>> goto nla_put_failure;
>>
>> +   if (nla_put_u32(skb, TCA_CAKE_AUTORATE,
>> +   !!(q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS)))
>> +   goto nla_put_failure;
>> +
>> +   if (nla_put_u32(skb, TCA_CAKE_INGRESS,
>> +   !!(q->rate_flags & CAKE_FLAG_INGRESS)))
>> +   goto nla_put_failure;
>> +
>
> Why do you want to dump each bit of the rate_flags separately rather than
> dumping the whole rate_flags as an integer?

Well, these were added one at a time, each as a new option. Isn't that
more or less congruent with how netlink attributes are supposed to be
used?

-Toke
___
Cake mailing list
Cake@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cake


Re: [Cake] [PATCH net-next v12 4/7] sch_cake: Add NAT awareness to packet classifier

2018-05-16 Thread Cong Wang
On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen  wrote:
> When CAKE is deployed on a gateway that also performs NAT (which is a
> common deployment mode), the host fairness mechanism cannot distinguish
> internal hosts from each other, and so fails to work correctly.
>
> To fix this, we add an optional NAT awareness mode, which will query the
> kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
> and use that in the flow and host hashing.
>
> When the shaper is enabled and the host is already performing NAT, the cost
> of this lookup is negligible. However, in unlimited mode with no NAT being
> performed, there is a significant CPU cost at higher bandwidths. For this
> reason, the feature is turned off by default.
>
> Signed-off-by: Toke Høiland-Jørgensen 
> ---
>  net/sched/sch_cake.c |   73 
> ++
>  1 file changed, 73 insertions(+)
>
> diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
> index 65439b643c92..e1038a7b6686 100644
> --- a/net/sched/sch_cake.c
> +++ b/net/sched/sch_cake.c
> @@ -71,6 +71,12 @@
>  #include 
>  #include 
>
> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
> +#include 
> +#include 
> +#include 
> +#endif
> +
>  #define CAKE_SET_WAYS (8)
>  #define CAKE_MAX_TINS (8)
>  #define CAKE_QUEUES (1024)
> @@ -514,6 +520,60 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
> return drop;
>  }
>
> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
> +
> +static void cake_update_flowkeys(struct flow_keys *keys,
> +const struct sk_buff *skb)
> +{
> +   const struct nf_conntrack_tuple *tuple;
> +   enum ip_conntrack_info ctinfo;
> +   struct nf_conn *ct;
> +   bool rev = false;
> +
> +   if (tc_skb_protocol(skb) != htons(ETH_P_IP))
> +   return;
> +
> +   ct = nf_ct_get(skb, );
> +   if (ct) {
> +   tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
> +   } else {
> +   const struct nf_conntrack_tuple_hash *hash;
> +   struct nf_conntrack_tuple srctuple;
> +
> +   if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
> +  NFPROTO_IPV4, dev_net(skb->dev),
> +  ))
> +   return;
> +
> +   hash = nf_conntrack_find_get(dev_net(skb->dev),
> +_ct_zone_dflt,
> +);
> +   if (!hash)
> +   return;
> +
> +   rev = true;
> +   ct = nf_ct_tuplehash_to_ctrack(hash);
> +   tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
> +   }
> +
> +   keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip;
> +   keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip;
> +
> +   if (keys->ports.ports) {
> +   keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
> +   keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
> +   }
> +   if (rev)
> +   nf_ct_put(ct);
> +}
> +#else
> +static void cake_update_flowkeys(struct flow_keys *keys,
> +const struct sk_buff *skb)
> +{
> +   /* There is nothing we can do here without CONNTRACK */
> +}
> +#endif
> +
>  /* Cake has several subtle multiple bit settings. In these cases you
>   *  would be matching triple isolate mode as well.
>   */
> @@ -541,6 +601,9 @@ static u32 cake_hash(struct cake_tin_data *q, const 
> struct sk_buff *skb,
> skb_flow_dissect_flow_keys(skb, ,
>FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
>
> +   if (flow_mode & CAKE_FLOW_NAT_FLAG)
> +   cake_update_flowkeys(, skb);
> +
> /* flow_hash_from_keys() sorts the addresses by value, so we have
>  * to preserve their order in a separate data structure to treat
>  * src and dst host addresses as independently selectable.
> @@ -1727,6 +1790,12 @@ static int cake_change(struct Qdisc *sch, struct 
> nlattr *opt,
> q->flow_mode = (nla_get_u32(tb[TCA_CAKE_FLOW_MODE]) &
> CAKE_FLOW_MASK);
>
> +   if (tb[TCA_CAKE_NAT]) {
> +   q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
> +   q->flow_mode |= CAKE_FLOW_NAT_FLAG *
> +   !!nla_get_u32(tb[TCA_CAKE_NAT]);
> +   }


I think it's better to return -EOPNOTSUPP when CONFIG_NF_CONNTRACK
is not enabled.


> +
> if (tb[TCA_CAKE_RTT]) {
> q->interval = nla_get_u32(tb[TCA_CAKE_RTT]);
>
> @@ -1892,6 +1961,10 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff 
> *skb)
> if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter))
> goto nla_put_failure;
>
> +   if (nla_put_u32(skb, TCA_CAKE_NAT,
> +   !!(q->flow_mode & 

Re: [Cake] [PATCH net-next v12 4/7] sch_cake: Add NAT awareness to packet classifier

2018-05-16 Thread Toke Høiland-Jørgensen
Cong Wang  writes:

> On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen  wrote:
>> When CAKE is deployed on a gateway that also performs NAT (which is a
>> common deployment mode), the host fairness mechanism cannot distinguish
>> internal hosts from each other, and so fails to work correctly.
>>
>> To fix this, we add an optional NAT awareness mode, which will query the
>> kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
>> and use that in the flow and host hashing.
>>
>> When the shaper is enabled and the host is already performing NAT, the cost
>> of this lookup is negligible. However, in unlimited mode with no NAT being
>> performed, there is a significant CPU cost at higher bandwidths. For this
>> reason, the feature is turned off by default.
>>
>> Signed-off-by: Toke Høiland-Jørgensen 
>> ---
>>  net/sched/sch_cake.c |   73 
>> ++
>>  1 file changed, 73 insertions(+)
>>
>> diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
>> index 65439b643c92..e1038a7b6686 100644
>> --- a/net/sched/sch_cake.c
>> +++ b/net/sched/sch_cake.c
>> @@ -71,6 +71,12 @@
>>  #include 
>>  #include 
>>
>> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
>> +#include 
>> +#include 
>> +#include 
>> +#endif
>> +
>>  #define CAKE_SET_WAYS (8)
>>  #define CAKE_MAX_TINS (8)
>>  #define CAKE_QUEUES (1024)
>> @@ -514,6 +520,60 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
>> return drop;
>>  }
>>
>> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
>> +
>> +static void cake_update_flowkeys(struct flow_keys *keys,
>> +const struct sk_buff *skb)
>> +{
>> +   const struct nf_conntrack_tuple *tuple;
>> +   enum ip_conntrack_info ctinfo;
>> +   struct nf_conn *ct;
>> +   bool rev = false;
>> +
>> +   if (tc_skb_protocol(skb) != htons(ETH_P_IP))
>> +   return;
>> +
>> +   ct = nf_ct_get(skb, );
>> +   if (ct) {
>> +   tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
>> +   } else {
>> +   const struct nf_conntrack_tuple_hash *hash;
>> +   struct nf_conntrack_tuple srctuple;
>> +
>> +   if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
>> +  NFPROTO_IPV4, dev_net(skb->dev),
>> +  ))
>> +   return;
>> +
>> +   hash = nf_conntrack_find_get(dev_net(skb->dev),
>> +_ct_zone_dflt,
>> +);
>> +   if (!hash)
>> +   return;
>> +
>> +   rev = true;
>> +   ct = nf_ct_tuplehash_to_ctrack(hash);
>> +   tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
>> +   }
>> +
>> +   keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip;
>> +   keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip;
>> +
>> +   if (keys->ports.ports) {
>> +   keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
>> +   keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
>> +   }
>> +   if (rev)
>> +   nf_ct_put(ct);
>> +}
>> +#else
>> +static void cake_update_flowkeys(struct flow_keys *keys,
>> +const struct sk_buff *skb)
>> +{
>> +   /* There is nothing we can do here without CONNTRACK */
>> +}
>> +#endif
>> +
>>  /* Cake has several subtle multiple bit settings. In these cases you
>>   *  would be matching triple isolate mode as well.
>>   */
>> @@ -541,6 +601,9 @@ static u32 cake_hash(struct cake_tin_data *q, const 
>> struct sk_buff *skb,
>> skb_flow_dissect_flow_keys(skb, ,
>>FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
>>
>> +   if (flow_mode & CAKE_FLOW_NAT_FLAG)
>> +   cake_update_flowkeys(, skb);
>> +
>> /* flow_hash_from_keys() sorts the addresses by value, so we have
>>  * to preserve their order in a separate data structure to treat
>>  * src and dst host addresses as independently selectable.
>> @@ -1727,6 +1790,12 @@ static int cake_change(struct Qdisc *sch, struct 
>> nlattr *opt,
>> q->flow_mode = (nla_get_u32(tb[TCA_CAKE_FLOW_MODE]) &
>> CAKE_FLOW_MASK);
>>
>> +   if (tb[TCA_CAKE_NAT]) {
>> +   q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
>> +   q->flow_mode |= CAKE_FLOW_NAT_FLAG *
>> +   !!nla_get_u32(tb[TCA_CAKE_NAT]);
>> +   }
>
>
> I think it's better to return -EOPNOTSUPP when CONFIG_NF_CONNTRACK
> is not enabled.

Good point, will fix :)

-Toke
___
Cake mailing list
Cake@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cake


Re: [Cake] [PATCH net-next v12 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc

2018-05-16 Thread Toke Høiland-Jørgensen
Cong Wang  writes:

> On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen  wrote:
>> +
>> +static struct Qdisc *cake_leaf(struct Qdisc *sch, unsigned long arg)
>> +{
>> +   return NULL;
>> +}
>> +
>> +static unsigned long cake_find(struct Qdisc *sch, u32 classid)
>> +{
>> +   return 0;
>> +}
>> +
>> +static void cake_walk(struct Qdisc *sch, struct qdisc_walker *arg)
>> +{
>> +}
>
>
> Thanks for adding the support to other TC filters, it is much better
> now!

You're welcome. Turned out not to be that hard :)

> A quick question: why class_ops->dump_stats is still NULL?
>
> It is supposed to dump the stats of each flow. Is there still any
> difficulty to map it to tc class? I thought you figured it out when
> you added the tcf_classify().

On the classify side, I solved the "multiple sets of queues" problem by
using skb->priority to select the tin (diffserv tier) and the classifier
output to select the queue within that tin. This would not work for
dumping stats; some other way of mapping queues to the linear class
space would be needed. And since we are not actually collecting any
per-flow stats that I could print, I thought it wasn't worth coming up
with a half-baked proposal for this just to add an API hook that no one
in the existing CAKE user base has ever asked for...

-Toke
___
Cake mailing list
Cake@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cake


Re: [Cake] [PATCH net-next v12 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc

2018-05-16 Thread Cong Wang
On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen  wrote:
> +
> +static struct Qdisc *cake_leaf(struct Qdisc *sch, unsigned long arg)
> +{
> +   return NULL;
> +}
> +
> +static unsigned long cake_find(struct Qdisc *sch, u32 classid)
> +{
> +   return 0;
> +}
> +
> +static void cake_walk(struct Qdisc *sch, struct qdisc_walker *arg)
> +{
> +}


Thanks for adding the support to other TC filters, it is much better now!

A quick question: why class_ops->dump_stats is still NULL?

It is supposed to dump the stats of each flow. Is there still any difficulty
to map it to tc class? I thought you figured it out when you added the
tcf_classify().
___
Cake mailing list
Cake@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cake


[Cake] [PATCH net-next v12 2/7] sch_cake: Add ingress mode

2018-05-16 Thread Toke Høiland-Jørgensen
The ingress mode is meant to be enabled when CAKE runs downlink of the
actual bottleneck (such as on an IFB device). The mode changes the shaper
to also account dropped packets to the shaped rate, as these have already
traversed the bottleneck.

Enabling ingress mode will also tune the AQM to always keep at least two
packets queued *for each flow*. This is done by scaling the minimum queue
occupancy level that will disable the AQM by the number of active bulk
flows. The rationale for this is that retransmits are more expensive in
ingress mode, since dropped packets have to traverse the bottleneck again
when they are retransmitted; thus, being more lenient and keeping a minimum
number of packets queued will improve throughput in cases where the number
of active flows are so large that they saturate the bottleneck even at
their minimum window size.

This commit also adds a separate switch to enable ingress mode rate
autoscaling. If enabled, the autoscaling code will observe the actual
traffic rate and adjust the shaper rate to match it. This can help avoid
latency increases in the case where the actual bottleneck rate decreases
below the shaped rate. The scaling filters out spikes by an EWMA filter.

Signed-off-by: Toke Høiland-Jørgensen 
---
 net/sched/sch_cake.c |   85 --
 1 file changed, 81 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 422cfccbf37f..d515f18f8460 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -433,7 +433,8 @@ static bool cobalt_queue_empty(struct cobalt_vars *vars,
 static bool cobalt_should_drop(struct cobalt_vars *vars,
   struct cobalt_params *p,
   ktime_t now,
-  struct sk_buff *skb)
+  struct sk_buff *skb,
+  u32 bulk_flows)
 {
bool next_due, over_target, drop = false;
ktime_t schedule;
@@ -457,6 +458,7 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
sojourn = ktime_to_ns(ktime_sub(now, cobalt_get_enqueue_time(skb)));
schedule = ktime_sub(now, vars->drop_next);
over_target = sojourn > p->target &&
+ sojourn > p->mtu_time * bulk_flows * 2 &&
  sojourn > p->mtu_time * 4;
next_due = vars->count && schedule >= 0;
 
@@ -910,6 +912,9 @@ static unsigned int cake_drop(struct Qdisc *sch, struct 
sk_buff **to_free)
b->tin_dropped++;
sch->qstats.drops++;
 
+   if (q->rate_flags & CAKE_FLAG_INGRESS)
+   cake_advance_shaper(q, b, skb, now, true);
+
__qdisc_drop(skb, to_free);
sch->q.qlen--;
 
@@ -986,8 +991,46 @@ static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
cake_heapify_up(q, b->overflow_idx[idx]);
 
/* incoming bandwidth capacity estimate */
-   q->avg_window_bytes = 0;
-   q->last_packet_time = now;
+   if (q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS) {
+   u64 packet_interval = \
+   ktime_to_ns(ktime_sub(now, q->last_packet_time));
+
+   if (packet_interval > NSEC_PER_SEC)
+   packet_interval = NSEC_PER_SEC;
+
+   /* filter out short-term bursts, eg. wifi aggregation */
+   q->avg_packet_interval = \
+   cake_ewma(q->avg_packet_interval,
+ packet_interval,
+ (packet_interval > q->avg_packet_interval ?
+ 2 : 8));
+
+   q->last_packet_time = now;
+
+   if (packet_interval > q->avg_packet_interval) {
+   u64 window_interval = \
+   ktime_to_ns(ktime_sub(now,
+ q->avg_window_begin));
+   u64 b = q->avg_window_bytes * (u64)NSEC_PER_SEC;
+
+   do_div(b, window_interval);
+   q->avg_peak_bandwidth =
+   cake_ewma(q->avg_peak_bandwidth, b,
+ b > q->avg_peak_bandwidth ? 2 : 8);
+   q->avg_window_bytes = 0;
+   q->avg_window_begin = now;
+
+   if (ktime_after(now,
+   ktime_add_ms(q->last_reconfig_time,
+250))) {
+   q->rate_bps = (q->avg_peak_bandwidth * 15) >> 4;
+   cake_reconfigure(sch);
+   }
+   }
+   } else {
+   q->avg_window_bytes = 0;
+   q->last_packet_time = now;
+   }
 
/* flowchain */
if (!flow->set || flow->set == CAKE_SET_DECAYING) {
@@ -1246,14 +1289,26 @@ static struct sk_buff 

[Cake] [PATCH net-next v12 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc

2018-05-16 Thread Toke Høiland-Jørgensen
sch_cake targets the home router use case and is intended to squeeze the
most bandwidth and latency out of even the slowest ISP links and routers,
while presenting an API simple enough that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash

CAKE is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Extensive support for DSL framing types.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency
  variation.

A paper describing the design of CAKE is available at
https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE
International Symposium on Local and Metropolitan Area Networks (LANMAN).

This patch adds the base shaper and packet scheduler, while subsequent
commits add the optional (configurable) features. The full userspace API
and most data structures are included in this commit, but options not
understood in the base version will be ignored.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

CAKE's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the cake@lists.bufferbloat.net mailing list.

tc -s qdisc show dev eth2
qdisc cake 1: root refcnt 2 bandwidth 100Mbit diffserv3 triple-isolate rtt 
100.0ms raw overhead 0
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 0b of 500b
 capacity estimate: 100Mbit
 min/max network layer size:65535 /   0
 min/max overhead-adjusted size:65535 /   0
 average network hdr offset:0

   Bulk  Best EffortVoice
  thresh   6250Kbit  100Mbit   25Mbit
  target  5.0ms5.0ms5.0ms
  interval  100.0ms  100.0ms  100.0ms
  pk_delay  0us  0us  0us
  av_delay  0us  0us  0us
  sp_delay  0us  0us  0us
  pkts000
  bytes   000
  way_inds000
  way_miss000
  way_cols000
  drops   000
  marks   000
  ack_drop000
  sp_flows000
  bk_flows000
  un_flows000
  max_len 000
  quantum   300 1514  762

Tested-by: Pete Heist 
Tested-by: Georgios Amanakis 
Signed-off-by: Dave Taht 
Signed-off-by: Toke Høiland-Jørgensen 
---
 include/uapi/linux/pkt_sched.h |  105 ++
 net/sched/Kconfig  |   11 
 net/sched/Makefile |1 
 net/sched/sch_cake.c   | 1739 
 4 files changed, 1856 insertions(+)
 create mode 100644 net/sched/sch_cake.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096ae97b..883e84f008d7 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -934,4 +934,109 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+/* CAKE */
+enum {
+   TCA_CAKE_UNSPEC,
+   TCA_CAKE_BASE_RATE64,
+   TCA_CAKE_DIFFSERV_MODE,
+   TCA_CAKE_ATM,
+   TCA_CAKE_FLOW_MODE,
+   TCA_CAKE_OVERHEAD,
+   TCA_CAKE_RTT,
+   TCA_CAKE_TARGET,
+   TCA_CAKE_AUTORATE,
+   TCA_CAKE_MEMORY,
+   TCA_CAKE_NAT,
+   TCA_CAKE_RAW,
+   TCA_CAKE_WASH,
+   TCA_CAKE_MPU,
+ 

[Cake] [PATCH net-next v12 4/7] sch_cake: Add NAT awareness to packet classifier

2018-05-16 Thread Toke Høiland-Jørgensen
When CAKE is deployed on a gateway that also performs NAT (which is a
common deployment mode), the host fairness mechanism cannot distinguish
internal hosts from each other, and so fails to work correctly.

To fix this, we add an optional NAT awareness mode, which will query the
kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
and use that in the flow and host hashing.

When the shaper is enabled and the host is already performing NAT, the cost
of this lookup is negligible. However, in unlimited mode with no NAT being
performed, there is a significant CPU cost at higher bandwidths. For this
reason, the feature is turned off by default.

Signed-off-by: Toke Høiland-Jørgensen 
---
 net/sched/sch_cake.c |   73 ++
 1 file changed, 73 insertions(+)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 65439b643c92..e1038a7b6686 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -71,6 +71,12 @@
 #include 
 #include 
 
+#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
+#include 
+#include 
+#include 
+#endif
+
 #define CAKE_SET_WAYS (8)
 #define CAKE_MAX_TINS (8)
 #define CAKE_QUEUES (1024)
@@ -514,6 +520,60 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
return drop;
 }
 
+#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
+
+static void cake_update_flowkeys(struct flow_keys *keys,
+const struct sk_buff *skb)
+{
+   const struct nf_conntrack_tuple *tuple;
+   enum ip_conntrack_info ctinfo;
+   struct nf_conn *ct;
+   bool rev = false;
+
+   if (tc_skb_protocol(skb) != htons(ETH_P_IP))
+   return;
+
+   ct = nf_ct_get(skb, );
+   if (ct) {
+   tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
+   } else {
+   const struct nf_conntrack_tuple_hash *hash;
+   struct nf_conntrack_tuple srctuple;
+
+   if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
+  NFPROTO_IPV4, dev_net(skb->dev),
+  ))
+   return;
+
+   hash = nf_conntrack_find_get(dev_net(skb->dev),
+_ct_zone_dflt,
+);
+   if (!hash)
+   return;
+
+   rev = true;
+   ct = nf_ct_tuplehash_to_ctrack(hash);
+   tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
+   }
+
+   keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip;
+   keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip;
+
+   if (keys->ports.ports) {
+   keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
+   keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
+   }
+   if (rev)
+   nf_ct_put(ct);
+}
+#else
+static void cake_update_flowkeys(struct flow_keys *keys,
+const struct sk_buff *skb)
+{
+   /* There is nothing we can do here without CONNTRACK */
+}
+#endif
+
 /* Cake has several subtle multiple bit settings. In these cases you
  *  would be matching triple isolate mode as well.
  */
@@ -541,6 +601,9 @@ static u32 cake_hash(struct cake_tin_data *q, const struct 
sk_buff *skb,
skb_flow_dissect_flow_keys(skb, ,
   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
 
+   if (flow_mode & CAKE_FLOW_NAT_FLAG)
+   cake_update_flowkeys(, skb);
+
/* flow_hash_from_keys() sorts the addresses by value, so we have
 * to preserve their order in a separate data structure to treat
 * src and dst host addresses as independently selectable.
@@ -1727,6 +1790,12 @@ static int cake_change(struct Qdisc *sch, struct nlattr 
*opt,
q->flow_mode = (nla_get_u32(tb[TCA_CAKE_FLOW_MODE]) &
CAKE_FLOW_MASK);
 
+   if (tb[TCA_CAKE_NAT]) {
+   q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
+   q->flow_mode |= CAKE_FLOW_NAT_FLAG *
+   !!nla_get_u32(tb[TCA_CAKE_NAT]);
+   }
+
if (tb[TCA_CAKE_RTT]) {
q->interval = nla_get_u32(tb[TCA_CAKE_RTT]);
 
@@ -1892,6 +1961,10 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff 
*skb)
if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter))
goto nla_put_failure;
 
+   if (nla_put_u32(skb, TCA_CAKE_NAT,
+   !!(q->flow_mode & CAKE_FLOW_NAT_FLAG)))
+   goto nla_put_failure;
+
return nla_nest_end(skb, opts);
 
 nla_put_failure:

___
Cake mailing list
Cake@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cake


[Cake] [PATCH net-next v12 3/7] sch_cake: Add optional ACK filter

2018-05-16 Thread Toke Høiland-Jørgensen
The ACK filter is an optional feature of CAKE which is designed to improve
performance on links with very asymmetrical rate limits. On such links
(which are unfortunately quite prevalent, especially for DSL and cable
subscribers), the downstream throughput can be limited by the number of
ACKs capable of being transmitted in the *upstream* direction.

Filtering ACKs can, in general, have adverse effects on TCP performance
because it interferes with ACK clocking (especially in slow start), and it
reduces the flow's resiliency to ACKs being dropped further along the path.
To alleviate these drawbacks, the ACK filter in CAKE tries its best to
always keep enough ACKs queued to ensure forward progress in the TCP flow
being filtered. It does this by only filtering redundant ACKs. In its
default 'conservative' mode, the filter will always keep at least two
redundant ACKs in the queue, while in 'aggressive' mode, it will filter
down to a single ACK.

The ACK filter works by inspecting the per-flow queue on every packet
enqueue. Starting at the head of the queue, the filter looks for another
eligible packet to drop (so the ACK being dropped is always closer to the
head of the queue than the packet being enqueued). An ACK is eligible only
if it ACKs *fewer* cumulative bytes than the new packet being enqueued.
This prevents duplicate ACKs from being filtered (unless there is also SACK
options present), to avoid interfering with retransmission logic. In
aggressive mode, an eligible packet is always dropped, while in
conservative mode, at least two ACKs are kept in the queue. Only pure ACKs
(with no data segments) are considered eligible for dropping, but when an
ACK with data segments is enqueued, this can cause another pure ACK to
become eligible for dropping.

The approach described above ensures that this ACK filter avoids most of
the drawbacks of a naive filtering mechanism that only keeps flow state but
does not inspect the queue. This is the rationale for including the ACK
filter in CAKE itself rather than as separate module (as the TC filter, for
instance).

Our performance evaluation has shown that on a 30/1 Mbps link with a
bidirectional traffic test (RRUL), turning on the ACK filter on the
upstream link improves downstream throughput by ~20% (both modes) and
upstream throughput by ~12% in conservative mode and ~40% in aggressive
mode, at the cost of ~5ms of inter-flow latency due to the increased
congestion.

In *really* pathological cases, the effect can be a lot more; for instance,
the ACK filter increases the achievable downstream throughput on a link
with 100 Kbps in the upstream direction by an order of magnitude (from ~2.5
Mbps to ~25 Mbps).

Finally, even though we consider the ACK filter to be safer than most, we
do not recommend turning it on everywhere: on more symmetrical link
bandwidths the effect is negligible at best.

Signed-off-by: Toke Høiland-Jørgensen 
---
 net/sched/sch_cake.c |  260 ++
 1 file changed, 258 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index d515f18f8460..65439b643c92 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -755,6 +755,239 @@ static void flow_queue_add(struct cake_flow *flow, struct 
sk_buff *skb)
skb->next = NULL;
 }
 
+static struct iphdr *cake_get_iphdr(const struct sk_buff *skb,
+   struct ipv6hdr *buf)
+{
+   unsigned int offset = skb_network_offset(skb);
+   struct iphdr *iph;
+
+   iph = skb_header_pointer(skb, offset, sizeof(struct iphdr), buf);
+
+   if (!iph)
+   return NULL;
+
+   if (iph->version == 4 && iph->protocol == IPPROTO_IPV6)
+   return skb_header_pointer(skb, offset + iph->ihl * 4,
+ sizeof(struct ipv6hdr), buf);
+
+   else if (iph->version == 4)
+   return iph;
+
+   else if (iph->version == 6)
+   return skb_header_pointer(skb, offset, sizeof(struct ipv6hdr),
+ buf);
+
+   return NULL;
+}
+
+static struct tcphdr *cake_get_tcphdr(const struct sk_buff *skb,
+ void *buf, unsigned int bufsize)
+{
+   unsigned int offset = skb_network_offset(skb);
+   const struct ipv6hdr *ipv6h;
+   const struct tcphdr *tcph;
+   const struct iphdr *iph;
+   struct ipv6hdr _ipv6h;
+   struct tcphdr _tcph;
+
+   ipv6h = skb_header_pointer(skb, offset, sizeof(_ipv6h), &_ipv6h);
+
+   if (!ipv6h)
+   return NULL;
+
+   if (ipv6h->version == 4) {
+   iph = (struct iphdr *)ipv6h;
+   offset += iph->ihl * 4;
+
+   /* special-case 6in4 tunnelling, as that is a common way to get
+* v6 connectivity in the home
+*/
+   if (iph->protocol == IPPROTO_IPV6) {
+   ipv6h = 

[Cake] [PATCH net-next v12 6/7] sch_cake: Add overhead compensation support to the rate shaper

2018-05-16 Thread Toke Høiland-Jørgensen
This commit adds configurable overhead compensation support to the rate
shaper. With this feature, userspace can configure the actual bottleneck
link overhead and encapsulation mode used, which will be used by the shaper
to calculate the precise duration of each packet on the wire.

This feature is needed because CAKE is often deployed one or two hops
upstream of the actual bottleneck (which can be, e.g., inside a DSL or
cable modem). In this case, the link layer characteristics and overhead
reported by the kernel does not match the actual bottleneck. Being able to
set the actual values in use makes it possible to configure the shaper rate
much closer to the actual bottleneck rate (our experience shows it is
possible to get with 0.1% of the actual physical bottleneck rate), thus
keeping latency low without sacrificing bandwidth.

The overhead compensation has three tunables: A fixed per-packet overhead
size (which, if set, will be accounted from the IP packet header), a
minimum packet size (MPU) and a framing mode supporting either ATM or PTM
framing. We include a set of common keywords in TC to help users configure
the right parameters. If no overhead value is set, the value reported by
the kernel is used.

Signed-off-by: Toke Høiland-Jørgensen 
---
 net/sched/sch_cake.c |  124 ++
 1 file changed, 123 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index f0f94d536e51..1ce81d919f73 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -271,6 +271,7 @@ enum {
 
 struct cobalt_skb_cb {
ktime_t enqueue_time;
+   u32 adjusted_len;
 };
 
 static u64 us_to_ns(u64 us)
@@ -1120,6 +1121,88 @@ static u64 cake_ewma(u64 avg, u64 sample, u32 shift)
return avg;
 }
 
+static u32 cake_calc_overhead(struct cake_sched_data *q, u32 len, u32 off)
+{
+   if (q->rate_flags & CAKE_FLAG_OVERHEAD)
+   len -= off;
+
+   if (q->max_netlen < len)
+   q->max_netlen = len;
+   if (q->min_netlen > len)
+   q->min_netlen = len;
+
+   len += q->rate_overhead;
+
+   if (len < q->rate_mpu)
+   len = q->rate_mpu;
+
+   if (q->atm_mode == CAKE_ATM_ATM) {
+   len += 47;
+   len /= 48;
+   len *= 53;
+   } else if (q->atm_mode == CAKE_ATM_PTM) {
+   /* Add one byte per 64 bytes or part thereof.
+* This is conservative and easier to calculate than the
+* precise value.
+*/
+   len += (len + 63) / 64;
+   }
+
+   if (q->max_adjlen < len)
+   q->max_adjlen = len;
+   if (q->min_adjlen > len)
+   q->min_adjlen = len;
+
+   return len;
+}
+
+static u32 cake_overhead(struct cake_sched_data *q, const struct sk_buff *skb)
+{
+   const struct skb_shared_info *shinfo = skb_shinfo(skb);
+   unsigned int hdr_len, last_len = 0;
+   u32 off = skb_network_offset(skb);
+   u32 len = qdisc_pkt_len(skb);
+   u16 segs = 1;
+
+   q->avg_netoff = cake_ewma(q->avg_netoff, off << 16, 8);
+
+   if (!shinfo->gso_size)
+   return cake_calc_overhead(q, len, off);
+
+   /* borrowed from qdisc_pkt_len_init() */
+   hdr_len = skb_transport_header(skb) - skb_mac_header(skb);
+
+   /* + transport layer */
+   if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 |
+   SKB_GSO_TCPV6))) {
+   const struct tcphdr *th;
+   struct tcphdr _tcphdr;
+
+   th = skb_header_pointer(skb, skb_transport_offset(skb),
+   sizeof(_tcphdr), &_tcphdr);
+   if (likely(th))
+   hdr_len += __tcp_hdrlen(th);
+   } else {
+   struct udphdr _udphdr;
+
+   if (skb_header_pointer(skb, skb_transport_offset(skb),
+  sizeof(_udphdr), &_udphdr))
+   hdr_len += sizeof(struct udphdr);
+   }
+
+   if (unlikely(shinfo->gso_type & SKB_GSO_DODGY))
+   segs = DIV_ROUND_UP(skb->len - hdr_len,
+   shinfo->gso_size);
+   else
+   segs = shinfo->gso_segs;
+
+   len = shinfo->gso_size + hdr_len;
+   last_len = skb->len - shinfo->gso_size * (segs - 1);
+
+   return (cake_calc_overhead(q, len, off) * (segs - 1) +
+   cake_calc_overhead(q, last_len, off));
+}
+
 static void cake_heap_swap(struct cake_sched_data *q, u16 i, u16 j)
 {
struct cake_heap_entry ii = q->overflow_heap[i];
@@ -1197,7 +1280,7 @@ static int cake_advance_shaper(struct cake_sched_data *q,
   struct sk_buff *skb,
   ktime_t now, bool drop)
 {
-   u32 len = qdisc_pkt_len(skb);
+   u32 len = get_cobalt_cb(skb)->adjusted_len;
 
/* charge packet bandwidth to 

[Cake] [PATCH net-next v12 7/7] sch_cake: Conditionally split GSO segments

2018-05-16 Thread Toke Høiland-Jørgensen
At lower bandwidths, the transmission time of a single GSO segment can add
an unacceptable amount of latency due to HOL blocking. Furthermore, with a
software shaper, any tuning mechanism employed by the kernel to control the
maximum size of GSO segments is thrown off by the artificial limit on
bandwidth. For this reason, we split GSO segments into their individual
packets iff the shaper is active and configured to a bandwidth <= 1 Gbps.

Signed-off-by: Toke Høiland-Jørgensen 
---
 net/sched/sch_cake.c |   99 +-
 1 file changed, 73 insertions(+), 26 deletions(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 1ce81d919f73..dca276806e9f 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -82,6 +82,7 @@
 #define CAKE_QUEUES (1024)
 #define CAKE_FLOW_MASK 63
 #define CAKE_FLOW_NAT_FLAG 64
+#define CAKE_SPLIT_GSO_THRESHOLD (12500) /* 1Gbps */
 
 /* struct cobalt_params - contains codel and blue parameters
  * @interval:  codel initial drop rate
@@ -1474,36 +1475,73 @@ static s32 cake_enqueue(struct sk_buff *skb, struct 
Qdisc *sch,
if (unlikely(len > b->max_skblen))
b->max_skblen = len;
 
-   cobalt_set_enqueue_time(skb, now);
-   get_cobalt_cb(skb)->adjusted_len = cake_overhead(q, skb);
-   flow_queue_add(flow, skb);
-
-   if (q->ack_filter)
-   ack = cake_ack_filter(q, flow);
+   if (skb_is_gso(skb) && q->rate_flags & CAKE_FLAG_SPLIT_GSO) {
+   struct sk_buff *segs, *nskb;
+   netdev_features_t features = netif_skb_features(skb);
+   unsigned int slen = 0;
+
+   segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
+   if (IS_ERR_OR_NULL(segs))
+   return qdisc_drop(skb, sch, to_free);
+
+   while (segs) {
+   nskb = segs->next;
+   segs->next = NULL;
+   qdisc_skb_cb(segs)->pkt_len = segs->len;
+   cobalt_set_enqueue_time(segs, now);
+   get_cobalt_cb(segs)->adjusted_len = cake_overhead(q,
+ segs);
+   flow_queue_add(flow, segs);
+
+   sch->q.qlen++;
+   slen += segs->len;
+   q->buffer_used += segs->truesize;
+   b->packets++;
+   segs = nskb;
+   }
 
-   if (ack) {
-   b->ack_drops++;
-   sch->qstats.drops++;
-   b->bytes += qdisc_pkt_len(ack);
-   len -= qdisc_pkt_len(ack);
-   q->buffer_used += skb->truesize - ack->truesize;
-   if (q->rate_flags & CAKE_FLAG_INGRESS)
-   cake_advance_shaper(q, b, ack, now, true);
+   /* stats */
+   b->bytes+= slen;
+   b->backlogs[idx]+= slen;
+   b->tin_backlog  += slen;
+   sch->qstats.backlog += slen;
+   q->avg_window_bytes += slen;
 
-   qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(ack));
-   consume_skb(ack);
+   qdisc_tree_reduce_backlog(sch, 1, len);
+   consume_skb(skb);
} else {
-   sch->q.qlen++;
-   q->buffer_used  += skb->truesize;
-   }
+   /* not splitting */
+   cobalt_set_enqueue_time(skb, now);
+   get_cobalt_cb(skb)->adjusted_len = cake_overhead(q, skb);
+   flow_queue_add(flow, skb);
+
+   if (q->ack_filter)
+   ack = cake_ack_filter(q, flow);
+
+   if (ack) {
+   b->ack_drops++;
+   sch->qstats.drops++;
+   b->bytes += qdisc_pkt_len(ack);
+   len -= qdisc_pkt_len(ack);
+   q->buffer_used += skb->truesize - ack->truesize;
+   if (q->rate_flags & CAKE_FLAG_INGRESS)
+   cake_advance_shaper(q, b, ack, now, true);
+
+   qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(ack));
+   consume_skb(ack);
+   } else {
+   sch->q.qlen++;
+   q->buffer_used  += skb->truesize;
+   }
 
-   /* stats */
-   b->packets++;
-   b->bytes+= len;
-   b->backlogs[idx]+= len;
-   b->tin_backlog  += len;
-   sch->qstats.backlog += len;
-   q->avg_window_bytes += len;
+   /* stats */
+   b->packets++;
+   b->bytes+= len;
+   b->backlogs[idx]+= len;
+   b->tin_backlog  += len;
+   sch->qstats.backlog += len;
+   q->avg_window_bytes += len;
+   }
 
if 

Re: [Cake] [PATCH net-next v11 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc

2018-05-16 Thread Toke Høiland-Jørgensen
David Miller  writes:

> From: Toke Høiland-Jørgensen 
> Date: Tue, 15 May 2018 17:12:44 +0200
>
>> +typedef u64 cobalt_time_t;
>> +typedef s64 cobalt_tdiff_t;
>  ...
>> +static cobalt_time_t cobalt_get_time(void)
>> +{
>> +return ktime_get_ns();
>> +}
>> +
>> +static u32 cobalt_time_to_us(cobalt_time_t val)
>> +{
>> +do_div(val, NSEC_PER_USEC);
>> +return (u32)val;
>> +}
>
> If fundamentally you are working with ktime_t values, please use that type
> and the associated helpers.
>
> This is a valid argument that using custom typedefs provide documentation
> and an aid to understanding, but I think it doesn't serve that purpose
> very well here.
>
> So please just use ktime_t throughout instead of this cobalt_time_t
> and cobalt_tdiff_t.  And then use helpers like ktime_to_us() which
> properly optimize for 64-bit vs. 32-bit hosts.

Can do :)

-Toke
___
Cake mailing list
Cake@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cake


Re: [Cake] [PATCH net-next v11 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc

2018-05-16 Thread David Miller
From: Toke Høiland-Jørgensen 
Date: Tue, 15 May 2018 17:12:44 +0200

> +typedef u64 cobalt_time_t;
> +typedef s64 cobalt_tdiff_t;
 ...
> +static cobalt_time_t cobalt_get_time(void)
> +{
> + return ktime_get_ns();
> +}
> +
> +static u32 cobalt_time_to_us(cobalt_time_t val)
> +{
> + do_div(val, NSEC_PER_USEC);
> + return (u32)val;
> +}

If fundamentally you are working with ktime_t values, please use that type
and the associated helpers.

This is a valid argument that using custom typedefs provide documentation
and an aid to understanding, but I think it doesn't serve that purpose
very well here.

So please just use ktime_t throughout instead of this cobalt_time_t
and cobalt_tdiff_t.  And then use helpers like ktime_to_us() which
properly optimize for 64-bit vs. 32-bit hosts.

Thank you.
___
Cake mailing list
Cake@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cake