Ok, the lockup goes away if you use no-split-gso on the cake qdiscs for the default traffic (noted below in the drr and hfsc cases with "!!! must use no-split-gso here !!!"). Only I’d like my 600 μs back. :)
This smells of a bug Toke fixed on Sep 12, 2018 in 42e87f12ea5c390bf5eeb658c942bc810046160a, but then reverted in the next commit because it was fixed upstream. However, if I re-apply that commit, it still doesn’t fix it. Perhaps there are more cases where skb_reset_mac_len(skb) needs to be called somewhere for VLAN support? I managed to capture some output from what happens to hfsc: [ 683.864456] ------------[ cut here ]------------ [ 683.869116] WARNING: CPU: 1 PID: 11 at net/sched/sch_hfsc.c:1427 0xf9ced4ef() [ 683.876267] Modules linked in: cls_u32 em_meta cls_basic sch_cake(O) sch_drr xt_ACCOUNT(O) sch_hfsc cls_fw sch_sfq sch_prio ipt_Ra [ 683.931317] CPU: 1 PID: 11 Comm: ksoftirqd/1 Tainted: G W O 3.16.7-ckt9-voyage #1 [ 683.939595] Hardware name: PC Engines APU/APU, BIOS 4.0 09/08/2014 [ 683.945790] 00000000 00000000 f5c8bc9c c13167e9 00000000 f5c8bcb4 c102a7dd f9ced4ef [ 683.953792] f1907c00 00000000 00000000 f5c8bcc4 c102a803 00000009 00000000 f5c8bce4 [ 683.961791] f9ced4ef f1907fc8 732494ae 00000002 f1907c00 00000000 00000040 f5c8bd00 [ 683.969783] Call Trace: [ 683.972256] [<c13167e9>] dump_stack+0x41/0x52 [ 683.976729] [<c102a7dd>] warn_slowpath_common+0x5c/0x73 [ 683.982063] [<f9ced4ef>] ? 0xf9ced4ee [ 683.985834] [<c102a803>] warn_slowpath_null+0xf/0x13 [ 683.990905] [<f9ced4ef>] 0xf9ced4ee [ 683.994499] [<c129edf2>] __qdisc_run+0x81/0xf0 [ 683.999052] [<c128b655>] __dev_queue_xmit+0x23d/0x35f [ 684.004216] [<c128b78b>] dev_queue_xmit+0xa/0xc [ 684.008857] [<f89fff29>] register_vlan_dev+0x938/0xe3b [8021q] [ 684.014799] [<c128b33b>] dev_hard_start_xmit+0x29e/0x37b [ 684.020223] [<c128b6c0>] __dev_queue_xmit+0x2a8/0x35f [ 684.025381] [<c128b78b>] dev_queue_xmit+0xa/0xc [ 684.030016] [<c12cf8d3>] arp_xmit+0x1c/0x47 [ 684.034307] [<c12cff27>] arp_send+0x2e/0x33 [ 684.038598] [<c12d01b4>] arp_process+0x288/0x4d8 [ 684.043331] [<c12ad986>] ? ip_forward_finish+0x66/0x6b [ 684.048581] [<c128170e>] ? __kfree_skb+0x5d/0x5f [ 684.053303] [<c12d04ce>] arp_rcv+0xca/0x102 [ 684.057597] [<c12895dd>] __netif_receive_skb_core+0x467/0x4b6 [ 684.063453] [<c1289674>] __netif_receive_skb+0x48/0x59 [ 684.068698] [<c1289cb9>] netif_receive_skb_internal+0x59/0x85 [ 684.074557] [<c128a2cc>] napi_gro_receive+0x31/0x6d [ 684.079549] [<c10065ec>] ? text_poke_bp+0xa0/0xa0 [ 684.084369] [<f808604a>] 0xf8086049 [ 684.087974] [<c128a0b2>] net_rx_action+0x56/0x10e [ 684.092791] [<c102d689>] __do_softirq+0x91/0x175 [ 684.097523] [<c102d783>] run_ksoftirqd+0x16/0x29 [ 684.102255] [<c1042734>] smpboot_thread_fn+0x108/0x11e [ 684.107505] [<c104262c>] ? SyS_setgroups+0xa6/0xa6 [ 684.112403] [<c103de80>] kthread+0x9f/0xa4 [ 684.116615] [<c1319e01>] ret_from_kernel_thread+0x21/0x30 [ 684.122126] [<c103dde1>] ? kthread_freezable_should_stop+0x40/0x40 [ 684.128407] ---[ end trace cb7778967851e0ad ]--- [ 684.133646] ------------[ cut here ]------------ [ 684.138337] WARNING: CPU: 1 PID: 11 at net/sched/sch_hfsc.c:1427 0xf9ced4ef() [ 684.145487] Modules linked in: cls_u32 em_meta cls_basic sch_cake(O) sch_drr xt_ACCOUNT(O) sch_hfsc cls_fw sch_sfq sch_prio ipt_Ra [ 684.200459] CPU: 1 PID: 11 Comm: ksoftirqd/1 Tainted: G W O 3.16.7-ckt9-voyage #1 [ 684.208736] Hardware name: PC Engines APU/APU, BIOS 4.0 09/08/2014 [ 684.214933] 00000000 00000000 f5c8be98 c13167e9 00000000 f5c8beb0 c102a7dd f9ced4ef [ 684.222930] f1907c00 00000000 00000000 f5c8bec0 c102a803 00000009 00000000 f5c8bee0 [ 684.230928] f9ced4ef f1907fc8 7364c482 00000002 f1907c00 00000000 00000040 f5c8befc [ 684.238926] Call Trace: [ 684.241399] [<c13167e9>] dump_stack+0x41/0x52 [ 684.245870] [<c102a7dd>] warn_slowpath_common+0x5c/0x73 [ 684.251206] [<f9ced4ef>] ? 0xf9ced4ee [ 684.254979] [<c102a803>] warn_slowpath_null+0xf/0x13 [ 684.260055] [<f9ced4ef>] 0xf9ced4ee [ 684.263651] [<c129edf2>] __qdisc_run+0x81/0xf0 [ 684.268203] [<c128744f>] net_tx_action+0x91/0xdd [ 684.272927] [<c102d689>] __do_softirq+0x91/0x175 [ 684.277659] [<c102d783>] run_ksoftirqd+0x16/0x29 [ 684.282389] [<c1042734>] smpboot_thread_fn+0x108/0x11e [ 684.287633] [<c104262c>] ? SyS_setgroups+0xa6/0xa6 [ 684.292529] [<c103de80>] kthread+0x9f/0xa4 [ 684.296735] [<c1319e01>] ret_from_kernel_thread+0x21/0x30 [ 684.302246] [<c103dde1>] ? kthread_freezable_should_stop+0x40/0x40 [ 684.308536] ---[ end trace cb7778967851e0ae ]--- > On Dec 28, 2018, at 1:58 PM, Pete Heist <[email protected]> wrote: > > Note that this doesn’t happen when prio is used in place of hfsc and cake is > used in the leafs to do the rate limiting, i.e.: > > tc qdisc add dev eth0 root handle 1: prio bands 2 priomap 1 1 1 1 1 1 1 1 1 1 > 1 1 1 1 1 1 > tc qdisc add dev eth0 parent 1:1 handle 10: cake besteffort bandwidth 100mbit > ethernet # !!! must use no-split-gso here !!! > tc qdisc add dev eth0 parent 1:2 handle 11: cake besteffort bandwidth 100mbit > ethernet ether-vlan > tc filter add dev eth0 protocol all parent 1:0 prio 1 basic match "meta(vlan > mask 0xfff eq 0xce4)" flowid 1:2 > tc filter add dev eth0 protocol all parent 1:0 prio 2 u32 match u32 0 0 > flowid 1:1 > > But it does happen when drr is used instead of prio: > > tc qdisc add dev eth0 root handle 1: drr > tc class add dev eth0 parent 1: classid 1:1 > tc class add dev eth0 parent 1: classid 1:2 > tc qdisc add dev eth0 parent 1:1 handle 10: cake besteffort bandwidth 100mbit > tc qdisc add dev eth0 parent 1:2 handle 11: cake besteffort bandwidth 100mbit > ether-vlan > tc filter add dev eth0 protocol all parent 1:0 prio 1 basic match "meta(vlan > mask 0xfff eq 0xce4)" flowid 1:2 > tc filter add dev eth0 protocol all parent 1:0 prio 2 u32 match u32 0 0 > flowid 1:1 > > drr might ultimately be what I want to use for this, so I can use cake to do > the rate limiting instead of htb. prio works well but leads to starvation > when the rate limit is above what the CPU can handle. > > Meanwhile, using htb classes with rate limits way above the actual, then rate > limiting in the cake leafs, works as well, but this seems like a hack: > > tc qdisc add dev eth0 root handle 1: htb default 10 > tc class add dev eth0 parent 1: classid 1:1 htb rate 10gbit > tc class add dev eth0 parent 1:1 classid 1:10 htb rate 5gbit > tc class add dev eth0 parent 1:1 classid 1:11 htb rate 5gbit > tc filter add dev eth0 protocol ip parent 1:0 prio 1 basic match "meta(vlan > mask 0xfff eq 0xce4)" flowid 1:11 > tc qdisc add dev eth0 parent 1:10 handle 20: cake besteffort bandwidth > 100mbit ethernet # !!! must use no-split-gso here !!! > tc qdisc add dev eth0 parent 1:11 handle 21: cake besteffort bandwidth > 100mbit ethernet ether-vlan > >> On Dec 28, 2018, at 12:30 AM, Pete Heist <[email protected]> wrote: >> >> I’m seeing what I think it an infinite loop when cake is used in a one-armed >> router configuration with hfsc as the rate limiter. Three APUs are connected >> to the same switch and the “middle” APU (apu1a) routes between the default >> VLAN and a tagged VLAN. >> >> apu2a <— default VLAN —> apu1a <— VLAN 3300 —> apu2b >> >> After qos is set up, ping from apu2a to apu2b still works fine. When iperf3 >> is run from apu2a to apu2b it works fine, but when it goes in reverse (apu2b >> to apu2a), all traffic stops flowing from apu1a on the default VLAN. Traffic >> still flows from apu1a on VLAN 3300 however, with very high RTT (mean >> 500ms), leading me to believe that the cake instance on the default VLAN is >> in an infinite loop. >> >> It does not happen with hfsc+fq_codel, or with htb+cake in the same >> configuration. >> >> Here are the commands that set up qos, and it only locks up when cake is >> used as the instance at handle 20, not at handle 21: >> >> ----- >> tc qdisc add dev eth0 root handle 1: hfsc default 10 >> tc class add dev eth0 parent 1: classid 1:1 hfsc sc rate 200mbit ul rate >> 200mbit >> tc class add dev eth0 parent 1:1 classid 1:10 hfsc sc rate 100mbit ul rate >> 100mbit >> tc class add dev eth0 parent 1:1 classid 1:11 hfsc sc rate 100mbit ul rate >> 100mbit >> tc filter add dev eth0 protocol ip parent 1:0 prio 1 \ >> basic match "meta(vlan mask 0xfff eq 0xce4)" flowid 1:11 >> tc qdisc add dev eth0 parent 1:10 handle 20: fq_codel # using cake here >> locks up !!! >> tc qdisc add dev eth0 parent 1:11 handle 21: cake >> —— >> >> I’m using sch_cake and tc-adv from the current HEAD, on kernel 3.16.7 (yeah, >> I know). >> >> root@apu1a:~/qos# uname -a >> Linux apu1a 3.16.7-ckt9-voyage #1 SMP Thu Apr 23 11:10:44 HKT 2015 i686 >> GNU/Linux >> >> Any ideas just from just this? Otherwise, I can only think to hook up the >> serial cable and start with the printk’s… >> > _______________________________________________ Cake mailing list [email protected] https://lists.bufferbloat.net/listinfo/cake
