Re: tc tp creation performance degratation since kernel 5.1

Vlad Buslov Thu, 13 Jun 2019 08:39:30 -0700

On Thu 13 Jun 2019 at 11:11, Jiri Pirko <j...@resnulli.us> wrote:
> I made a mistake during measurements, sorry about that.
>
> This is the correct script:
> -----------------------------------------------------------------------
> #!/bin/bash
>
> dev=testdummy
> ip link add name $dev type dummy
> ip link set dev $dev up
> tc qdisc add dev $dev ingress
>
> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp")
> pref_id=1
>
> while [ $pref_id -lt 20000 ]
> do
>         echo "filter add dev $dev ingress proto ip pref $pref_id flower 
> action drop" >> $tmp_file_name
>         #echo "filter add dev $dev ingress proto ip pref $pref_id matchall 
> action drop" >> $tmp_file_name
>         ((pref_id++))
> done
>
> start=$(date +"%s")
> tc -b $tmp_file_name
> stop=$(date +"%s")
> echo "Insertion duration: $(($stop - $start)) sec"
> rm -f $tmp_file_name
>
> ip link del dev $dev
> -----------------------------------------------------------------------
>
> Note the commented out matchall. I don't see the regression with
> matchall. However, I see that with flower:
> kernel 5.1
> Insertion duration: 4 sec
> kernel 5.2
> Insertion duration: 163 sec
>
> I don't see any significant difference in perf:
> kernel 5.1
>     77.24%  tc               [kernel.vmlinux]    [k] tcf_chain_tp_find
>      1.67%  tc               [kernel.vmlinux]    [k] mutex_spin_on_owner
>      1.44%  tc               [kernel.vmlinux]    [k] 
> _raw_spin_unlock_irqrestore
>      0.93%  tc               [kernel.vmlinux]    [k] idr_get_free
>      0.79%  tc_pref_scale_o  [kernel.vmlinux]    [k] do_syscall_64
>      0.69%  tc               [kernel.vmlinux]    [k] finish_task_switch
>      0.53%  tc               libc-2.28.so        [.] 
> __memset_sse2_unaligned_erms
>      0.49%  tc               [kernel.vmlinux]    [k] __memset
>      0.36%  tc_pref_scale_o  libc-2.28.so        [.] malloc
>      0.30%  tc_pref_scale_o  libc-2.28.so        [.] _int_free
>      0.24%  tc               [kernel.vmlinux]    [k] __memcpy
>      0.23%  tc               [cls_flower]        [k] fl_change
>      0.23%  tc               [kernel.vmlinux]    [k] __nla_validate_parse
>      0.22%  tc               [kernel.vmlinux]    [k] __slab_alloc
>
>
>     75.57%  tc               [kernel.kallsyms]  [k] tcf_chain_tp_find
>      2.70%  tc               [kernel.kallsyms]  [k] 
> _raw_spin_unlock_irqrestore
>      1.13%  tc_pref_scale_o  [kernel.kallsyms]  [k] do_syscall_64
>      0.87%  tc               libc-2.28.so       [.] 
> __memset_sse2_unaligned_erms
>      0.86%  ip               [kernel.kallsyms]  [k] finish_task_switch
>      0.67%  tc               [kernel.kallsyms]  [k] memset
>      0.63%  tc               [kernel.kallsyms]  [k] mutex_spin_on_owner
>      0.52%  tc_pref_scale_o  libc-2.28.so       [.] malloc
>      0.48%  tc               [kernel.kallsyms]  [k] idr_get_free
>      0.46%  tc               [kernel.kallsyms]  [k] fl_change
>      0.42%  tc_pref_scale_o  libc-2.28.so       [.] _int_free
>      0.35%  tc_pref_scale_o  libc-2.28.so       [.] __GI___strlen_sse2
>      0.35%  tc_pref_scale_o  libc-2.28.so       [.] __mbrtowc
>      0.34%  tc_pref_scale_o  libc-2.28.so       [.] 
> __fcntl64_nocancel_adjusted
>
> Any ideas?


Thanks for providing reproduction script!

I've investigate the problem and found the root cause. First of all I
noticed that CPU utilization during problematic tc run is quite low
(<10%), so I decided to investigate why tc sleeps so much. I've used bcc
and obtained following off-CPU trace (uninteresting traces are omitted
for brevity):

~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc`
Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end.
...
    finish_task_switch
    __sched_text_start
    schedule
    schedule_timeout
    wait_for_completion
    __wait_rcu_gp
    synchronize_rcu
    fl_change
    tc_new_tfilter
    rtnetlink_rcv_msg
    netlink_rcv_skb
    netlink_unicast
    netlink_sendmsg
    sock_sendmsg
    ___sys_sendmsg
    __sys_sendmsg
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    -                tc (2069)
        142284953

As you can see 142 seconds are spent sleeping in synchronize_rcu(). The
code is in fl_create_new_mask() function:

        err = rhashtable_replace_fast(&head->ht, &mask->ht_node,
                                      &newmask->ht_node, mask_ht_params);
        if (err)
                goto errout_destroy;

        /* Wait until any potential concurrent users of mask are finished */
        synchronize_rcu();

The justification for this is described in comment in
fl_check_assign_mask() (user of fl_create_new_mask()):

        /* Insert mask as temporary node to prevent concurrent creation of mask
         * with same key. Any concurrent lookups with same key will return
         * -EAGAIN because mask's refcnt is zero. It is safe to insert
         * stack-allocated 'mask' to masks hash table because we call
         * synchronize_rcu() before returning from this function (either in case
         * of error or after replacing it with heap-allocated mask in
         * fl_create_new_mask()).
         */
        fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht,
                                                       &mask->ht_node,
                                                       mask_ht_params);

The offending commit is part of my series that implements unlocked
flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask
insertion")

The justification presented in it is no longer relevant since Ivan
Vecera changed mask to be dynamically allocated in commit 2cddd2014782
("net/sched: cls_flower: allocate mask dynamically in fl_change()").
With this we can just change fl_change() to deallocate temporary mask
with rcu grace period and remove offending syncrhonize_rcu() call.

Any other suggestions?

Re: tc tp creation performance degratation since kernel 5.1

Reply via email to