ankurrj7 wrote: I did a detailed local pass over this WIP PR and collected validation/benchmark data. Posting the notes here so the design tradeoff is explicit before asking for formal review.
### What this PR changes
- Adds an opt-in driver/cc1 flag: `-fcoverage-call-continuations`.
- Requires existing Clang source coverage mode: `-fprofile-instr-generate
-fcoverage-mapping`.
- Adds a continuation counter after returning calls. The counter is only
incremented if control returns normally to the continuation point.
- Coverage mapping uses that continuation counter for source after a call, so
source after `exit`, `longjmp`, or a non-returning call chain is not reported
as covered just because the call itself was reached.
- Default source coverage behavior is unchanged unless the new flag is passed.
### Validation run
- `git diff --check origin/main...HEAD`: clean.
- `git-clang-format --diff origin/main HEAD ...`: clean when run with the local
LLVM `clang-format` binary.
- Focused tests: 3/3 passed.
- `clang/test/CoverageMapping/call-continuations.c`
- `clang/test/CoverageMapping/call-continuations-tight.c`
- `clang/test/Driver/coverage-call-continuations.c`
- Broader coverage mapping tests: `clang/test/CoverageMapping`, 77/77 passed.
- Driver checks verified the flag is forwarded with coverage mapping and
rejected without `-fcoverage-mapping`.
### Behavior spot checks
| Case | Result with `-fcoverage-call-continuations` |
|---|---|
| Same-file `exit(0)` sink | Line after `terminate_same_file()` is `0`; line
after caller `fun1_same_file()` is `0`. |
| Cross-file `exit(0)` sink | Line after external `terminate_other_file()` is
`0`; line after caller `fun1_cross_file()` is `0`. |
| `setjmp`/`longjmp` | `if (sj == 0)` has count `2`, landing path after
`longjmp` is covered, and statements after the non-returning calls are `0`. |
The small behavior repro sources and reports are saved locally under:
```text
/scratch/ankurraj/clang_experiment_23/docs/pr201079-review/cases
/scratch/ankurraj/clang_experiment_23/temp/pr201079-review-cases
```
### Counter growth measured on the benchmark source
| Mode | Instrumented functions | Counter slots |
|---|---:|---:|
| Default Clang source coverage | 131 | 136 |
| `-fcoverage-call-continuations` | 131 | 272 |
For the small `setjmp/longjmp` repro, the counter slots changed from 4 to 11.
This is the expected cost of the mode: it buys precise post-call coverage by
adding counters after returning calls. This is why I think keeping it opt-in is
the right direction.
### GCC/ICC comparison benchmark
Benchmark machine/toolchain notes:
- Clang: `clang version 23.0.0git (2be52f1b4c7dd7907b08f554c0160550727195cb)`
- GCC: `gcc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)`
- ICC classic: `icc (ICC) 2021.10.1 20231115`
- Runtime input: `1200000` iterations, 3 runtime samples.
- Compile timing: 5 compile/link samples per case.
- Note: the optimizer/SLP fast-path investigation is not part of this PR. I am
not proposing to include optimizer changes here.
- The compile-time numbers below are local directional measurements only, not a
claim that continuation mode is inherently faster. The useful signal is that
this opt-in mode did not show a runtime regression on this call-heavy
benchmark, while the counter growth is visible and measurable.
Compile/link timing samples:
| Case | Samples (s) | Median (s) | Avg (s) |
|---|---:|---:|---:|
| Clang default source coverage | `0.61,0.61,0.61,0.61,0.62` | 0.61 | 0.61 |
| Clang call continuations | `0.44,0.44,0.44,0.44,0.44` | 0.44 | 0.44 |
| GCC/gcov | `0.41,0.40,0.40,0.40,0.40` | 0.40 | 0.40 |
| ICC `-prof-gen=srcpos` | `0.44,0.44,0.44,0.47,0.44` | 0.44 | 0.45 |
Runtime samples:
| Case | Runtime samples (s) | Median (s) | Binary size | Coverage report
generated |
|---|---:|---:|---:|---|
| Clang default source coverage | `0.48,0.48,0.48` | 0.48 | 171360 | yes |
| Clang call continuations | `0.48,0.47,0.48` | 0.48 | 178328 | yes |
| GCC/gcov | `0.47,0.47,0.47` | 0.47 | 115624 | yes |
| ICC `-prof-gen=srcpos` | `0.67,0.67,0.67` | 0.67 | 260960 | yes |
Interpretation: runtime overhead for this benchmark is effectively unchanged
versus default Clang source coverage. Binary size grew by about 7 KB in this
small benchmark because of the extra continuation counters. The counter growth
is real and is the main reason the mode should remain opt-in.
### Benchmark source used
<details>
<summary>function_call_bench.c</summary>
```c
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#if defined(__GNUC__) || defined(__clang__)
#define NOINLINE __attribute__((noinline))
#else
#define NOINLINE
#endif
volatile uint64_t global_sink;
#define LEAF_LIST(F)
\
F(0)
\
F(1)
\
F(2)
\
F(3)
\
F(4)
\
F(5)
\
F(6)
\
F(7)
\
F(8)
\
F(9)
\
F(10)
\
F(11)
\
F(12)
\
F(13)
\
F(14)
\
F(15)
\
F(16)
\
F(17)
\
F(18)
\
F(19)
\
F(20)
\
F(21)
\
F(22)
\
F(23)
\
F(24)
\
F(25)
\
F(26)
\
F(27)
\
F(28)
\
F(29)
\
F(30)
\
F(31)
\
F(32)
\
F(33)
\
F(34)
\
F(35)
\
F(36)
\
F(37)
\
F(38)
\
F(39)
\
F(40)
\
F(41)
\
F(42)
\
F(43)
\
F(44)
\
F(45)
\
F(46)
\
F(47)
\
F(48)
\
F(49)
\
F(50)
\
F(51)
\
F(52)
\
F(53)
\
F(54)
\
F(55)
\
F(56)
\
F(57)
\
F(58)
\
F(59)
\
F(60)
\
F(61)
\
F(62)
\
F(63)
\
F(64)
\
F(65)
\
F(66)
\
F(67)
\
F(68)
\
F(69)
\
F(70)
\
F(71)
\
F(72)
\
F(73)
\
F(74)
\
F(75)
\
F(76)
\
F(77)
\
F(78)
\
F(79)
\
F(80)
\
F(81)
\
F(82)
\
F(83)
\
F(84)
\
F(85)
\
F(86)
\
F(87)
\
F(88)
\
F(89)
\
F(90)
\
F(91)
\
F(92)
\
F(93)
\
F(94)
\
F(95)
\
F(96)
\
F(97)
\
F(98)
\
F(99)
\
F(100)
\
F(101)
\
F(102)
\
F(103)
\
F(104)
\
F(105)
\
F(106)
\
F(107)
\
F(108)
\
F(109)
\
F(110)
\
F(111)
\
F(112)
\
F(113)
\
F(114)
\
F(115)
\
F(116)
\
F(117)
\
F(118)
\
F(119)
\
F(120)
\
F(121)
\
F(122)
\
F(123)
\
F(124)
\
F(125)
\
F(126)
\
F(127)
#define ROTL64(X, R) (((X) << (R)) | ((X) >> (64 - (R))))
#define DEFINE_LEAF(N)
\
static NOINLINE uint64_t leaf_##N(uint64_t x) {
\
enum { Rot = ((N) % 17) + 7 };
\
x ^= (uint64_t)((N) + 1) * UINT64_C(0x9e3779b97f4a7c15);
\
x = ROTL64(x, Rot);
\
x += (x >> 23) ^ (x << 11);
\
return x ^ (uint64_t)((N) * 1315423911u);
\
}
LEAF_LIST(DEFINE_LEAF)
#define CALL_LEAF(N)
\
x ^= leaf_##N(x + (uint64_t)(N));
\
x += ROTL64(x, ((N) % 13) + 5);
static NOINLINE uint64_t run_all(uint64_t x) {
LEAF_LIST(CALL_LEAF)
return x;
}
static NOINLINE uint64_t run_branchy(uint64_t x, uint64_t i) {
if ((x ^ i) & 1)
x ^= leaf_3(x);
else
x ^= leaf_97(x);
if ((x + i) & 8)
x += leaf_41(x);
else
x += leaf_113(x);
return x;
}
int main(int argc, char **argv) {
uint64_t iterations = 120000;
if (argc > 1)
iterations = strtoull(argv[1], NULL, 0);
uint64_t x = UINT64_C(0x123456789abcdef0);
for (uint64_t i = 0; i < iterations; ++i) {
x = run_all(x + i);
if ((i & 1023) == 0)
x ^= run_branchy(x, i);
}
global_sink = x;
printf("%" PRIu64 "\n", x);
return 0;
}
```
</details>
### Known caveats / review risks
- This mode intentionally adds counters after returning calls, so it can
noticeably increase counter count in call-heavy code.
- Profiles generated with and without this flag should be treated as different
instrumentation modes and should not be mixed.
- The installed validation compiler in this environment needs `-ldl` when
linking Clang profile-runtime coverage binaries because `libclang_rt.profile.a`
references `dladdr/dlopen/dlsym/dlclose/dlerror`.
- I have not included the separate SLPVectorizer compile-time optimization in
this PR. That should remain separate if we decide to pursue it.
My current read: this is reasonable to keep as WIP/RFC-quality and ask for
design feedback. The functional behavior is useful and the default mode stays
untouched, but reviewers will likely focus on the counter-growth tradeoff and
whether this belongs as a supported opt-in coverage mode.
https://github.com/llvm/llvm-project/pull/201079
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
