On Sun, 2026-03-08 at 13:55 +0800, sun jian wrote:
> On Sat, Mar 7, 2026 at 12:23 AM Alexei Starovoitov
> <[email protected]> wrote:
> > 
> > On Fri, Mar 6, 2026 at 8:15 AM Paul Chaignon <[email protected]> 
> > wrote:
> > Sun Jian,
> > I asked to do a _minimal_ tweak to pyperf600.
> > What you did is a drastic change. Pls don't hack tests
> > just to make them pass. The tests have to be meaningful
> > and test coverage shouldn't degrade.
> > 
> 
> Hi Alexei, Paul,
> 
> I spent some more time looking into this.
> 
> Comparing unmodified pyperf600 bytecode between clang-18 and clang-20, I
> see fewer instructions with clang-20 and nearly the same number of
> branches:
> 
> clang-18: 90134 lines of disassembly, 6090 gotos
> clang-20: 78369 lines of disassembly, 6085 gotos
> 
> So this does not look like a simple program-size increase. What seems to
> change is the branch layout in the unrolled loop body, which seems to
> make the verifier DFS go deeper before pruning.
> 
> One useful data point is that a single __on_event() copy does load
> successfully (that was my v2), while with 2 or more copies it
> consistently fails at exactly 8193 jumps. In other words, the verifier
> hits the jump-sequence limit before reaching the second copy.
> 
> I also tried a range of source-level mitigations, but so far I couldn't
> find one that preserves the test intent and keeps pyperf600 comparable
> to the other variants:
> 
> - UNROLL_COUNT tuning: 99 does not compile; 100-120 compile but still
> fail at 8193; 121-145 fail to compile; 146-150 compile but still fail
> at 8193
> - early break/goto on !frame_ptr: insufficient for pyperf600, and also
> hurts pyperf600_nounroll by adding branch points to the 600-iteration loop
> - wrapping 5x __on_event() in a non-unrolled loop: verifier still unrolls it
> - making get_frame_data() __noinline: still fails
> - moving the unwind loop into a __noinline subprog: still fails
> - SUBPROGS / __on_event as __noinline: still fails; codegen changes,
> but the verifier still hits 8193
> 
> Paul also mentioned trying STACK_MAX_LEN/UNROLL_COUNT and only getting it
> to work with STACK_MAX_LEN reduced to 180, which would make it too close
> to pyperf180.
> 
> The only source change I found that passes is reducing __on_event() to a
> single copy, but that clearly weakens the test as pointed out.
> 
> At this point, I don't have a source-level fix that preserves the test
> intent.

Hi Sun,

I have an old investigation for the pyperf600 failure reason from March 2024.
Attaching it to the email. The discussion happened off-list.
The source-level "mitigation" I found back then still stands:

  --- a/tools/testing/selftests/bpf/progs/pyperf.h
  +++ b/tools/testing/selftests/bpf/progs/pyperf.h
  @@ -97,8 +97,15 @@ static __always_inline bool get_frame_data(void 
*frame_ptr, PidData *pidData,
                              frame_ptr + pidData->offsets.PyFrameObject_code);
   
          // read data from PyCodeObject
  +#if __BPF_CPU_VERSION__ < 4
          if (!frame->f_code)
                  return false;
  +#else
  +        asm volatile goto("if %[f_code] == 0 goto %l[has_f_code];"
  +                             :: [f_code]"r"(frame->f_code) :: has_f_code);
  +        return false;
  +has_f_code:
  +#endif

(One needs cpuv4 because of the jump instructions exceeding 16-bit
 offset ranges are only possible with cpuv4).

The decision back then was that the "mitigation" is too brittle to
apply and we should leave the test as-is, hoping that verifier would
get smarter some day and be able to load the program.

Best regards,
Eduard
# What happened

The pyperf600 test fails to verify when compiled by recent clang revisions.
The last known good revision is [0], the first known bad revision is [1].
Revision [1] comes from the pull request [2].

Verifier error when using revision [1]:

    ...
    ; if (frame->co_name) @ pyperf.h:118
    25460: (79) r3 = *(u64 *)(r10 -32)    ; R3_w=scalar() R10=fp0 fp-32=mmmmmmmm
    25461: (15) if r3 == 0x0 goto pc+7
    The sequence of 8193 jumps is too complex.
    verification time 822174 usec
    stack depth 360

All testing below was done using revisions [0] and [1].

# pyperf600 structure

The relevant parts of the test look as follows:

    static __always_inline bool get_frame_data(...)
    {
        ...
        if (!frame->f_code)
            return false;
        ...
        if (frame->co_filename) { ... }
        if (frame->co_name) { ... }
        return true;
    }

    int __on_event(...)
    {
        ...
        #pragma clang loop unroll(UNROLL_COUNT)      //  UNROLL_COUNT == 150
        for (int i = 0; i < STACK_MAX_LEN; ++i)      // STACK_MAX_LEN == 600
            if (frame_ptr && get_frame_data(...)) {
                if (!symbol_id) { ... }
                if (*symbol_id == new_symbol_id) { ... }
                ...
            }
        ...
    }

    SEC("raw_tracepoint/kfree_skb")
    int on_event(struct bpf_raw_tracepoint_args* ctx)
    {
        ...
        __on_event(...);
        __on_event(...);
        __on_event(...);
        __on_event(...);
        __on_event(...);
        ...
    }

The call to get_frame_data() is inlined.
The main takeaways are:
- BPF program consists of five calls to __on_event();
- __on_event() has a big loop inside;
- loop body has 5 conditionals
  (when counted with conditionals in get_frame_data()).

# LLVM change description

The relevant part of [1] is:

    --- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
    +++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
    @@ -1282,7 +1295,7 @@ tryToUnrollLoop(Loop *L, DominatorTree &DT, LoopInfo *LI, ScalarEvolution &SE,
       }

       // Do not attempt partial/runtime unrolling in FullLoopUnrolling
    -  if (OnlyFullUnroll && !(UP.Count >= MaxTripCount)) {
    +  if (OnlyFullUnroll && (UP.Count < TripCount || UP.Count < MaxTripCount)) {
         LLVM_DEBUG(
             dbgs() << "Not attempting partial/runtime unroll in FullLoopUnroll.\n");
         return LoopUnrollResult::Unmodified;

- `UP.Count` is a preferred number of iterations to be unrolled,
  it is 150 for pyperf600;
- `TripCount` is a predicted number or loop iterations,
  it is 600 for pyperf600.

The hunk above does exactly what comments says:
prevents partial unrolling of the main pyperf600 loop on full
unrolling pass.

There is also a partial unrolling pass done later in the pipeline.

# LLVM change impact on pyperf600

Prior to [1] the loop in pyperf600 was unrolled by full unrolling pass,
after [1] it is unrolled by partial unrolling pass.
Such change causes a subtle rearrangement of basic blocks inside the
loop which turns out to be important for the verifier.

The rearrangement occurs inside inlined body of get_frame_data():

    static __always_inline bool get_frame_data(...)
    {
        ...
        if (!frame->f_code)
            return false;
        ...
    }

Translation before [1]:                 Translation after [1]

; if (!frame->f_code)                   ; if (!frame->f_code)
  r3 = *(u64 *)(r10 - 0x30)               r3 = *(u64 *)(r10 - 0x30)
  if r3 != 0x0 goto +0x2 <LBB0_19>        if r3 == 0x0 goto +0x4b <LBB0_39>

Before [1] the fall-through path is to `return false`,
after [1] the fall-through path is to the rest of get_frame_data() body.

The `if (!frame->f_code)` is the first conditional in the loop body
and it guards all other conditionals in the body
(when !frame_code == 1 the rest of conditionals is skipped).

Before [1] the verifier would process pyperf600 in the following order:

- __on_event()
  - process loop 600 times:
    - `if (!frame->f_code) return false`:
      - fall-through is to `return false`;
      - push one jump to jump history
      - assume fall-through branch and skip the rest of the loop body;
- __on_event(): same thing, push 600 jumps to jump history;
- __on_event(): same thing, push 600 jumps to jump history;
- __on_event(): same thing, push 600 jumps to jump history;
- __on_event():
  this is the last call to __on_event,
  all branches within it are verified before proceeding
  with branches pushed for previous calls.

When the loop inside the last call to __on_event() is verified a
checkpoint at it's start becomes viable.
Branches pushed when previous calls to __on_event() were processed
would eventually hit this checkpoint and the whole process would
converge eventually.
Thus, at it's peak the jump history length would be ~600*5 == 3000.

However, after [1] the fall-through path for the `if (!frame->f_code)`
leads to other conditionals in the loop body,
pushing up to 5 conditionals to jump history for each iteration.
Hence, peak jump history length would be something like ~600*5*5 == 15000.
Which is outside of current limits for the verifier.

# Possible fix #1: change pyperf600 basic blocks layout

The diff below is sufficient to make test verify again after [1]
(when tested for cpuv4, cpuv3 generates jumps overflowing 16-bit offset):

    --- a/tools/testing/selftests/bpf/progs/pyperf.h
    +++ b/tools/testing/selftests/bpf/progs/pyperf.h
    @@ -97,8 +97,10 @@ static __always_inline bool get_frame_data(void *frame_ptr, PidData *pidData,
                                frame_ptr + pidData->offsets.PyFrameObject_code);

            // read data from PyCodeObject
    -       if (!frame->f_code)
    -               return false;
    +       asm volatile goto("if %[f_code] != 0 goto %l[has_f_code]"
    +                         :: [f_code]"r"(frame->f_code) :: has_f_code);
    +       return false;
    +has_f_code:
            bpf_probe_read_user(&frame->co_filename,
                                sizeof(frame->co_filename),
                                frame->f_code + pidData->offsets.PyCodeObject_filename);

Effectively this forces verifier to first explore `return false`
branch of the first conditional in the loop body,
same way it was done before [1].

(The likely/unlikely macro relying on __builtin_expect() don't give
 the desired code layout for some reason).

# Possible fix #2: change pyperf600 limits

The diff below reduces the loop size sufficiently to fit inside jump
history (again, works for cpuv4, but not for cpuv3):

    --- a/tools/testing/selftests/bpf/progs/pyperf600.c
    +++ b/tools/testing/selftests/bpf/progs/pyperf600.c
    @@ -1,6 +1,6 @@
     // SPDX-License-Identifier: GPL-2.0
     // Copyright (c) 2019 Facebook
    -#define STACK_MAX_LEN 600
    +#define STACK_MAX_LEN 230
     /* Full unroll of 600 iterations will have total
      * program size close to 298k insns and this may
      * cause BPF_JMP insn out of 16-bit integer range.

# Possible fix #3: verifier changes

Another option would be to forgo current verifier conditional
exploration rules:
when inside a loop, don't explore the fall-through branch first,
instead predict which branch would push less conditionals
onto jump history and explore that first.

Need more time to asses if this a feasible option in terms of added complexity.

# Links

[0] Last good revision:
    c3291253c3b5 ("Revert "[scudo] [MTE] resize stack depot for allocation ring buffer" (#80777)")
[1] First broken revision:
    99ddd77ed9e1 ("[LoopUnroll] Introduce PragmaUnrollFullMaxIterations as a hard cap on how many iterations we try to unroll (#78648)")
[2] https://github.com/llvm/llvm-project/pull/78648

Reply via email to