Re: [RFC] GCC support for live-patching

2018-10-23 Thread Nicolai Stange
Hi Qing,

Qing Zhao  writes:

> thanks a lot for your detailed explanation of the source based live patch 
> creation procedure.
> really interesting and helpful information. 
>
> more questions and comments below:
>
>>> 
>>> One question here,  what’s the major benefit to prepare the patches 
>>> manually? 
>> 
>> There is none. We here at SUSE prefer the source based approach (as
>> opposed to binary diff) for a number of reasons and the manual live
>> patch creation is simply a consequence of not having any tooling for
>> this yet.
>> 
>> 
>> For reference, source based live patch creation involves the following
>> steps:
>> 
>> 1. Determine the initial set of to be patched functions:
>>   a.) Inspect the upstream diff for the fix in question, add
>>   any touched functions to the initial set.
>>   b.) For each function in the initial set, check whether it has been
>>   inlined/cloned/optimized and if so, add all its callers to the
>>   initial set.  Repeat until the initial set has stabilized.
>> 
>> 2. Copy & paste the initial set over to the new live patch sources.
>> 
>> 3. Make it compile, i.e. recursively copy any needed cpp macro, type, or
>>   functions definition and add references to data objects with static
>>   storage duration.
>>   The rules are:
>>   a.) For data objects with static storage duration, a reference to the
>>   original must always be made. (If the symbol is EXPORT()ed, then
>>   fine. Otherwise, for kGraft, this involves a kallsyms lookup at
>>   patch module load time, for upstream kernel live patching, this
>>   has been solved with those '.klp' relocations).
>>   b.) If a called function is available as a symbol from either vmlinux
>>   or some (usually the patched) module, do not copy the definition,
>>   but add a reference to it, just as in a.).
>>   c.) If a type, cpp macro or (usually inlined) function is provided by
>>   some "public" header in /include/, include that
>>   rather than copying the definition.  Counterexample: Non-public
>>   header outside of include/ like
>>   e.g. /fs/btrfs/qgroup.h.
>>   d.) Otherwise copy the definition to the live patch module sources.
>> 
>> Rule 3b is not strictly necessary, but it helps in reducing the live
>> patch code size which is a factor with _manual_ live patch creation.
>> 
>> For 1b.), we need help from GCC. Namely, we want to know when some
>> functions has been optimized and we want it to disable any of those IPA
>> optimization it (currently) isn't capable to report properly.
>
> Yes, this this is the place that GCC can help. and it’s also the motivation 
> for this current proposal.
>
>> 
>> Step 3.) is a bit tedious sometimes TBH and yes, w/o any tooling in
>> place, patch size would be a valid point. However, I'm currently working
>> on that and I'm optimistic that I'll have a working prototype soon.
>
> So, currently, step 3 is done manually?

Currently yes.


> If the initial set of patched functions is too big, this work is
> really tedious and error-prone.

For the "tedious" part: yes it sometimes is. But we haven't seen any
problems or bugs so far. So it's reliable in practice.


>
> without a good tool for step3, controlling the initial set size of patched 
> function is still meaningful.
>
>> 
>> That tool would be given the GCC command line from the original or "live
>> patch target" kernel compilation for the source file in question, the
>> set of functions as determined in 1.) and a number of user provided
>> filter scripts to make the decisions in 3.). As a result, it would
>> output a self-contained, minimal subset of the original kernel sources.
>> 
>> With that tooling in place, live patch code size would not be a real
>> concern for us.
>
> that’s good to know.
>
> however, my question here is:
>
> can this tool be easily adopted by other applications than linux kernel? i.e, 
> if there is another application that tries to use GCC’s live patching
> feature with manually created source patch, will your tool for step 3 be 
> readily used by this application?  Or, this application have to develop
> a similar but different tool for itself?

This tool's scope is the C99 language, more specifically the GCC dialect
and I'll try to keep the CLI agnostic to the application or build
environment anyway.

That said, my primary interest is the Linux kernel and I'm going to
make sure that it works there. It might not work out of the box for
random applications, but require some tweaking or even bug fixing.



>> 
>> So in conclusion, what we need from GCC is the information on when we
>> have to live patch callers due to optimizations. If that's not possible
>> for a particular class of optimization, it needs to be disabled.
>> 
>> OTOH, we definitely want to keep the set of these disabled optimizations
>> as small as possible in order to limit the impact of live patching on
>> kernel performance. In particular, disabling any of the "cloning"
>> 

Re: [RFC] GCC support for live-patching

2018-10-23 Thread Nicolai Stange
Hi,

Qing Zhao  writes:

>> 
>> thanks for the proposal. The others have already expressed some of my 
>> worries and remarks, but I think it would be only right to write them 
>> again. Especially since I am part of the team responsible for 
>> implementation and maintenance of live patches here at SUSE, we use kGraft 
>> and we prepare everything manually (compared to kpatch and ksplice).
>
> One question here,  what’s the major benefit to prepare the patches manually? 

There is none. We here at SUSE prefer the source based approach (as
opposed to binary diff) for a number of reasons and the manual live
patch creation is simply a consequence of not having any tooling for
this yet.


For reference, source based live patch creation involves the following
steps:

1. Determine the initial set of to be patched functions:
   a.) Inspect the upstream diff for the fix in question, add
   any touched functions to the initial set.
   b.) For each function in the initial set, check whether it has been
   inlined/cloned/optimized and if so, add all its callers to the
   initial set.  Repeat until the initial set has stabilized.

2. Copy & paste the initial set over to the new live patch sources.

3. Make it compile, i.e. recursively copy any needed cpp macro, type, or
   functions definition and add references to data objects with static
   storage duration.
   The rules are:
   a.) For data objects with static storage duration, a reference to the
   original must always be made. (If the symbol is EXPORT()ed, then
   fine. Otherwise, for kGraft, this involves a kallsyms lookup at
   patch module load time, for upstream kernel live patching, this
   has been solved with those '.klp' relocations).
   b.) If a called function is available as a symbol from either vmlinux
   or some (usually the patched) module, do not copy the definition,
   but add a reference to it, just as in a.).
   c.) If a type, cpp macro or (usually inlined) function is provided by
   some "public" header in /include/, include that
   rather than copying the definition.  Counterexample: Non-public
   header outside of include/ like
   e.g. /fs/btrfs/qgroup.h.
   d.) Otherwise copy the definition to the live patch module sources.

Rule 3b is not strictly necessary, but it helps in reducing the live
patch code size which is a factor with _manual_ live patch creation.

For 1b.), we need help from GCC. Namely, we want to know when some
functions has been optimized and we want it to disable any of those IPA
optimization it (currently) isn't capable to report properly.

Step 3.) is a bit tedious sometimes TBH and yes, w/o any tooling in
place, patch size would be a valid point. However, I'm currently working
on that and I'm optimistic that I'll have a working prototype soon.

That tool would be given the GCC command line from the original or "live
patch target" kernel compilation for the source file in question, the
set of functions as determined in 1.) and a number of user provided
filter scripts to make the decisions in 3.). As a result, it would
output a self-contained, minimal subset of the original kernel sources.

With that tooling in place, live patch code size would not be a real
concern for us.

So in conclusion, what we need from GCC is the information on when we
have to live patch callers due to optimizations. If that's not possible
for a particular class of optimization, it needs to be disabled.

OTOH, we definitely want to keep the set of these disabled optimizations
as small as possible in order to limit the impact of live patching on
kernel performance. In particular, disabling any of the "cloning"
optimizations, which GCC is able to report properly, would be a
no-no IMO.

IIUC, our preferred selection of allowed IPA optimizations would be
provided by what you are referring to as "-flive-patching=inline-clone".



>> 
>>> 1. A study of Kernel live patching schemes.
>>> 
>>> Three major kernel live patching tools:  https://lwn.net/Articles/734765/
>>> 
>>> * ksplice:   http://www.ksplice.com/doc/ksplice.pdf
>>> * kpatch:https://lwn.net/Articles/597123/
>>>https://github.com/dynup/kpatch
>>> * kGraft:
>>> https://pdfs.semanticscholar.org/presentation/af4c/895aa3fef0cc2b501317aaec9d91ba2d704c.pdf
>>> 
>>> In the above, ksplice and kpatch can automatically generate binary patches 
>>> as following:
>>> 
>>>   * a collection of tools which convert a source diff patch to a patch
>>> module. They work by compiling the kernel both with and without the source
>>> patch, comparing the binaries, and generating a binary patch module which 
>>> includes new binary versions of the functions to be replaced.
>>> 
>>> on the other hand, kGraft offers a way to create patches entirely by hand. 
>>> The source of the patch is a single C file, easy to review, easy to
>>> maintain. 
>>> 
>>> In addition to kGraft, there are other live patching tools that prefer
>>> creating patches 

Performance impact of disabling non-clone IPA optimizations for the Linux kernel (was: "GCC options for kernel live-patching")

2018-10-23 Thread Nicolai Stange
Hi,

let me summarize some results from performance comparisons of Linux
kernels compiled with and without certain IPA optimizations.

It's a slight abuse of this thread, but I think having the numbers might
perhaps give some useful insights on the potential costs associated with
the -flive-patching discussed here.

All kudos go to Giovanni Gherdovich from the SUSE Performance Team who
did all of the work presented below.

For a TL;DR, see the conclusion at the end of this email.

Martin Jambor  writes:

> (this message is a part of the thread originating with
> https://gcc.gnu.org/ml/gcc-patches/2018-09/msg01018.html)
>
> We have just had a quick discussion with two upstream maintainers of
> Linux kernel live-patching about this and the key points were:
>
> 1. SUSE live-patch creators (and I assume all that use the upstream
>live-patching method) use Martin Liska's (somewhat under-documented)
>-fdump-ipa-clones option and a utility he wrote
>(https://github.com/marxin/kgraft-analysis-tool) to deal with all
>kinds of inlining, IPA-CP and generally all IPA optimizations that
>internally create a clone.  The tool tells them what happened and
>also lists all callers that need to be live-patched.
>
> 2. However, there is growing concern about other IPA analyses that do
>not create a clone but still affect code generation in other
>functions.  Kernel developers have identified and disabled IPA-RA but
>there is more of them such as IPA-modref analysis, stack alignment
>propagation and possibly quite a few others which extract information
>from one function and use it a caller or perhaps even some
>almost-unrelated functions (such as detection of read-only and
>write-only static global variables).
>
>The kernel live-patching community would welcome if GCC had an option
>that could disable all such optimizations/analyses for which it
>cannot provide a list of all affected functions (i.e. which ones need
>to be live-patched if a particular function is).

AFAIU, the currently known IPA optimizations of this category are
(c.f. [1] and [2] from this thread):

 - -fipa-pure-const
 - -fipa-pta
 - -fipa-reference
 - -fipa-ra
 - -fipa-icf
 - -fipa-bit-cp
 - -fipa-vrp
 - and some others which might be problematic but currently can't get
   disabled on the cli:
   - stack alignment requirements
   - duplication of or skipping of alias analysis for
 functions/variables whose address is not taken (I don't know what
 that means, TBH).

Some time ago, Giovanni compared the performance of a kernel compiled with

 -fno-ipa-pure-const
 -fno-ipa-pta
 -fno-ipa-reference
 -fno-ipa-ra
 -fno-ipa-icf
 -fno-ipa-bit-cp
 -fno-ipa-vrp

plus (because I wasn't able to tell whether these are problematic in the
context of live patching)

 -fno-ipa-cp
 -fno-ipa-cp-clone
 -fno-ipa-profile
 -fno-ipa-sra

against a kernel compiled without any of these.

The kernel was a 4.12.14 one with additional patches on top.

The benchmarks had been performed on a smaller and on a bigger machine
each. Specs:
- single socket with a Xeon E3-1240 v5 (Skylake), 4 cores / 8 threads,
  32G of memory (UMA)
- 2 sockets with each one mounting a Xeon E5-2698 v4 (Broadwell) for a
  total of 40 cores / 80 threads and 528G of memory (NUMA)

You can find the results here:

  
https://beta.suse.com/private/nstange/ggherdovich-no-ipa-results/dashboard.html

"laurel2" is the smaller machine, "hardy4" the bigger one.

The numbers presented in the dashboard are a relative measure of how the
no-ipa kernel was performing in comparison to the stock one. "1" means
no change, and, roughly speaking, each deviation by 0.01 from that value
corresponds to an overall performance change of 1%. Depending on the
benchmark, higher means better (e.g. for throughput) or vice versa
(e.g. for latencies). Some of the numbers are highlighted in green or
red. Green means that the no-ipa kernel performs better, red the
contrary.

The sockperf-{tcp,udp}-under-load results are spoiled due to outliers,
probably because of slow vs. fast paths. Please ignore.

(If you're interested in the detailed results, you can click on any of
 those accumulated numbers in the dashboard. Scroll down and you'll find
 some nice plots.)

For the overall outcome, let me quote Giovanni who summarized it nicely:

  What's left in red:

  * fsmark-threaded on laurel2 (skylake 8 cores), down 2%: if you look at the
histograms of files created per seconds, there is never a clear winner
between with and without IPA (except for the single-threaded case). Clean
on hardy4.

  * sockperf-udp-throughput, hardy4: yep this one is statistically
significant (in the plot you clearly see that the green dots are all
below the yellow dots). 4% worst on average. Clean on the other machine.

  * tbench: this one is significant too (look at the histogram, no
overlapping between the two distributions) but it's a curious one,
because on the