On Thu 13 Jun 2019 at 11:11, Jiri Pirko <j...@resnulli.us> wrote: > I made a mistake during measurements, sorry about that. > > This is the correct script: > ----------------------------------------------------------------------- > #!/bin/bash > > dev=testdummy > ip link add name $dev type dummy > ip link set dev $dev up > tc qdisc add dev $dev ingress > > tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp") > pref_id=1 > > while [ $pref_id -lt 20000 ] > do > echo "filter add dev $dev ingress proto ip pref $pref_id flower > action drop" >> $tmp_file_name > #echo "filter add dev $dev ingress proto ip pref $pref_id matchall > action drop" >> $tmp_file_name > ((pref_id++)) > done > > start=$(date +"%s") > tc -b $tmp_file_name > stop=$(date +"%s") > echo "Insertion duration: $(($stop - $start)) sec" > rm -f $tmp_file_name > > ip link del dev $dev > ----------------------------------------------------------------------- > > Note the commented out matchall. I don't see the regression with > matchall. However, I see that with flower: > kernel 5.1 > Insertion duration: 4 sec > kernel 5.2 > Insertion duration: 163 sec > > I don't see any significant difference in perf: > kernel 5.1 > 77.24% tc [kernel.vmlinux] [k] tcf_chain_tp_find > 1.67% tc [kernel.vmlinux] [k] mutex_spin_on_owner > 1.44% tc [kernel.vmlinux] [k] > _raw_spin_unlock_irqrestore > 0.93% tc [kernel.vmlinux] [k] idr_get_free > 0.79% tc_pref_scale_o [kernel.vmlinux] [k] do_syscall_64 > 0.69% tc [kernel.vmlinux] [k] finish_task_switch > 0.53% tc libc-2.28.so [.] > __memset_sse2_unaligned_erms > 0.49% tc [kernel.vmlinux] [k] __memset > 0.36% tc_pref_scale_o libc-2.28.so [.] malloc > 0.30% tc_pref_scale_o libc-2.28.so [.] _int_free > 0.24% tc [kernel.vmlinux] [k] __memcpy > 0.23% tc [cls_flower] [k] fl_change > 0.23% tc [kernel.vmlinux] [k] __nla_validate_parse > 0.22% tc [kernel.vmlinux] [k] __slab_alloc > > > 75.57% tc [kernel.kallsyms] [k] tcf_chain_tp_find > 2.70% tc [kernel.kallsyms] [k] > _raw_spin_unlock_irqrestore > 1.13% tc_pref_scale_o [kernel.kallsyms] [k] do_syscall_64 > 0.87% tc libc-2.28.so [.] > __memset_sse2_unaligned_erms > 0.86% ip [kernel.kallsyms] [k] finish_task_switch > 0.67% tc [kernel.kallsyms] [k] memset > 0.63% tc [kernel.kallsyms] [k] mutex_spin_on_owner > 0.52% tc_pref_scale_o libc-2.28.so [.] malloc > 0.48% tc [kernel.kallsyms] [k] idr_get_free > 0.46% tc [kernel.kallsyms] [k] fl_change > 0.42% tc_pref_scale_o libc-2.28.so [.] _int_free > 0.35% tc_pref_scale_o libc-2.28.so [.] __GI___strlen_sse2 > 0.35% tc_pref_scale_o libc-2.28.so [.] __mbrtowc > 0.34% tc_pref_scale_o libc-2.28.so [.] > __fcntl64_nocancel_adjusted > > Any ideas?
Thanks for providing reproduction script! I've investigate the problem and found the root cause. First of all I noticed that CPU utilization during problematic tc run is quite low (<10%), so I decided to investigate why tc sleeps so much. I've used bcc and obtained following off-CPU trace (uninteresting traces are omitted for brevity): ~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc` Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end. ... finish_task_switch __sched_text_start schedule schedule_timeout wait_for_completion __wait_rcu_gp synchronize_rcu fl_change tc_new_tfilter rtnetlink_rcv_msg netlink_rcv_skb netlink_unicast netlink_sendmsg sock_sendmsg ___sys_sendmsg __sys_sendmsg do_syscall_64 entry_SYSCALL_64_after_hwframe - tc (2069) 142284953 As you can see 142 seconds are spent sleeping in synchronize_rcu(). The code is in fl_create_new_mask() function: err = rhashtable_replace_fast(&head->ht, &mask->ht_node, &newmask->ht_node, mask_ht_params); if (err) goto errout_destroy; /* Wait until any potential concurrent users of mask are finished */ synchronize_rcu(); The justification for this is described in comment in fl_check_assign_mask() (user of fl_create_new_mask()): /* Insert mask as temporary node to prevent concurrent creation of mask * with same key. Any concurrent lookups with same key will return * -EAGAIN because mask's refcnt is zero. It is safe to insert * stack-allocated 'mask' to masks hash table because we call * synchronize_rcu() before returning from this function (either in case * of error or after replacing it with heap-allocated mask in * fl_create_new_mask()). */ fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht, &mask->ht_node, mask_ht_params); The offending commit is part of my series that implements unlocked flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask insertion") The justification presented in it is no longer relevant since Ivan Vecera changed mask to be dynamically allocated in commit 2cddd2014782 ("net/sched: cls_flower: allocate mask dynamically in fl_change()"). With this we can just change fl_change() to deallocate temporary mask with rcu grace period and remove offending syncrhonize_rcu() call. Any other suggestions?