On 04/10/2019 00:10, Steven Rostedt wrote:
> On Wed, 2 Oct 2019 20:21:06 +0200
> Peter Zijlstra <pet...@infradead.org> wrote:
> 
>> On Wed, Oct 02, 2019 at 06:35:26PM +0200, Daniel Bristot de Oliveira wrote:
>>
>>> ftrace was already batching the updates, for instance, causing 3 IPIs to 
>>> enable
>>> all functions. The text_poke() batching also works. But because of the 
>>> limited
>>> buffer [ see the reply to the patch 2/3 ], it is flushing the buffer during 
>>> the
>>> operation, causing more IPIs than the previous code. Using the 5.4-rc1 in a 
>>> VM,
>>> when enabling the function tracer, I see 250+ intermediate 
>>> text_poke_finish()
>>> because of a full buffer...
>>>
>>> Would this be the case of trying to use a dynamically allocated buffer?
>>>
>>> Thoughts?  
>>
>> Is it a problem? I tried growing the buffer (IIRC I made it 10 times
>> bigger) and didn't see any performance improvements because of it.
> 
> I'm just worried if people are going to complain about the IPI burst.
> Although, I just tried it out before applying this patch, and there's
> still a bit of a burst. Not sure why. I did:
> 
> # cat /proc/interrupts > /tmp/before; echo function > 
> /debug/tracing/current_tracer; cat /proc/interrupts > /tmp/after
> # cat /proc/interrupts > /tmp/before1; echo nop > 
> /debug/tracing/current_tracer; cat /proc/interrupts > /tmp/after1
> 
> Before this patch:
> 
> # diff /tmp/before /tmp/after
> < CAL:       2342       2347       2116       2175       2446       2030      
>  2416       2222   Function call interrupts
> ---
>> CAL:       2462       2467       2236       2295       2446       2150       
>> 2536       2342   Function call interrupts
> 
> (Just showing the function call interrupts)
> 
> There appears to be 120 IPIs sent to all CPUS for enabling function tracer.
> 
> # diff /tmp/before1 /tmp/after1
> < CAL:       2462       2467       2236       2295       2446       2150      
>  2536       2342   Function call interrupts
> ---
>> CAL:       2577       2582       2351       2410       2446       2265       
>> 2651       2457   Function call interrupts
> 
> And 151 IPIs for disabling it.
> 
> After applying this patch:
> 
> # diff /tmp/before /tmp/after
> < CAL:      66070      46620      59955      59236      68707      63397      
> 61644      62742   Function call interrupts
> ---
>> CAL:      66727      47277      59955      59893      69364      64054      
>> 62301      63399   Function call interrupts
> 
> # diff /tmp/before1 /tmp/after1
> < CAL:      66727      47277      59955      59893      69364      64054      
> 62301      63399   Function call interrupts
> ---
>> CAL:      67358      47938      59985      60554      70025      64715      
>> 62962      64060   Function call interrupts
> 
> 
> We get 657 IPIs for enabling function tracer, and 661 for disabling it.
> Funny how it's more on the disable than the enable with the patch but
> the other way without it.
> 
> But still, we are going from 120 to 660 IPIs for every CPU. Not saying
> it's a problem, but something that we should note. Someone (those that
> don't like kernel interference) may complain.

That is the point I was raising.

When enabling ftrace, we have three different paths:

1) the enabling/disabling ftrace path
2) the int3 path - if a thread/irq is running a kernel function
3) the IPI - that affects all CPUs, even those that are not "hitting" trace
code, e.g., user-space.

The first one is for sure a cold-path. The second one is a hot-path: any task
running kernel functions will hit it. But IMHO, the hottest one is the IPIs,
because it will run on all CPUs, e.g., even isolated CPUs that are running in
user-space.

Currently, ftrace does:

        for_ftrace_rec:
                Install all breakpoints
        send IPI

        for_ftrace_rec:
                write the end of the instruction
        send IPI

        for_ftrace_rec:
                 write the first byte of the instruction
        send IPI

And that is the same thing we do with the batch mode, and so it would be better
to integrate both.

The problem is that considering the patch 2/3, the amount of entries we can
batch in the text_poke is limited, and so we batch on chunks of TP_VEC_MAX
entries. So, rather than doing 3 IPIs to change the code, we do:

(ftrace_rec count/TP_VEC_MAX) * 3 IPIs.

One possible solution for this would be to allocate a buffer with "number of
ftrace_rec" and use it in the text_poke batch mode.

But to do it, we should keep the old interface (the one that the 2/3 is
changing). Well, at least using a per-use-case buffer.

[ In addition ]

Currently, ftrace_rec entries are ordered inside the group of functions, but
"groups of function" are not ordered. So, the current int3 handler does a (*):

for_each_group_of_functions:
        check if the ip is in the range    ----> n by the number of groups.
                do a bsearch.              ----> log(n) by the numbers of entry
                                                 in the group.

If, instead, it uses an ordered vector, the complexity would be log(n) by the
total number of entries, which is better. So, how bad is the idea of:

        in the enabling ftrace code path, it:
                discover the number of entries
                alloc a buffer
                discover the order of the groups
                for each group in the correct order
                        queue the entry in the buffer
                apply the changes using the text_poke...

In this way we would optimize the two hot-paths:
        int3 will be log(n)
        IPIs bounded to 3.

I am not saying we need to do it now, as Steve said, not sure if this is a big
problem, but... those that don't like kernel interference may complain. But if
we leave the per-use-case vector in the text_poke_batch interface, things will
be easier to fix.

NOTE: the other IPIs are generated by hooking the tracepoints and switching the
code to RO/RW...
                
* as far as I understood ftrace_location_range().

-- Daniel

> -- Steve
> 

Reply via email to