tqchen commented on PR #589: URL: https://github.com/apache/tvm-ffi/pull/589#issuecomment-4446075606
## Detailed measurements ### Binary size — libtvm_ffi.so (stripped, Release / GCC 11.4 / ld.bfd 2.38 / x86_64 Linux) Isolation study: 4 builds from identical source state, toggling the macros via local redefinitions to isolate which macro contributes what. | Build | Stripped size | Delta | |------------------------------------|--------------:|------------:| | baseline (both macros no-op) | 1,887,800 | — | | `TVM_FFI_COLD_CODE` only | 1,834,568 | -53,232 B | | `TVM_FFI_PREDICT_FALSE/TRUE` only | 1,908,280 | +20,480 B | | both (this PR) | 1,842,728 | -45,072 B | Cold attribute alone (`current − no_cold`): -65,528 B (-3.4%). PREDICT macros alone (`current − no_predict`): +8,184 B (+0.4%). The cold attribute does most of the work for size. PREDICT macros are layout-only and slightly grow the binary — branch-prediction-driven basic-block reordering occasionally inserts extra jumps or duplicates code in the rearranged layout. The benefit of PREDICT is layout (hot fall-through stays contiguous), not size. ### Where the cold-attribute savings come from Assembly diff on `structural_equal.cc.o`: - `ErrorBuilder` ctor body: 1,045 B → 484 B per instance (-561 B per TU × 18 TUs ≈ ~10 KB at object level). - `ErrorBuilder` dtor body: essentially unchanged (727 vs 717 bytes). - Number of dtor call sites: identical in both builds (18 each). The dominant mechanism is GCC's "optimized for size rather than speed" codegen on cold function bodies — `-Os`-style codegen replaces loop unrolling, vectorization, branchless tricks — not inlining suppression as I had initially conjectured. The dtor was `[[noreturn]]` and already out-of-line, so cold-marking didn't change its call-site count. Across all `tvm_ffi_objs` `.cc.o` files: cold adds ~100 KB to `.text.unlikely` and shrinks regular `.text` by ~180 KB, for a net ~52 KB pre-link savings (which matches the stripped-binary delta). ### Cold cluster bounds — libtvm_ffi.so `.text` total: 1.45 MiB at `0x8430..0x16aee4`. Cold cluster: `0x8430..~0x22700` = approximately **103 KiB at the head of `.text`** (7.3% of `.text`). Contents: - ErrorBuilder ctors and dtor - `TVMFFISegFaultHandler`, `TVMFFIInstallSignalHandler` - Many compiler-emitted `.cold` / `.part.N` / `.isra.N` split bodies from hot functions whose callees became transitively cold - libbacktrace symbol-related helpers that get split-cold Public C ABI exports verified to remain in the hot region — `TVMFFIError*` family at `0x6b820+`, `TVMFFIBacktrace` at `0x47140`, `TVMFFIObjectIncRef`/`DecRef` deeper in. ### Cython extension `core.abi3.so` (stripped) | Build | Stripped | `.text` | `.eh_frame` | |---------------|---------:|------------:|------------:| | baseline | 788,200 | 602,875 | 39,628 | | with markers | 788,200 | 603,308 | 39,732 | | Δ | 0 | +433 (+0.07%) | +104 | Stripped on-disk size is unchanged — ELF page-alignment padding absorbs the sub-page `.text` delta. `.text` grows by 433 B (eh_frame metadata for cold-marked function epilogues + `__builtin_expect` bookkeeping). Cold cluster on the Cython side: ~2.7 KiB (`ForwardPyErrorToFFI` plus ~10 GCC auto-cold-split `.cold` thunks from large Pyx wrappers). Much smaller than `libtvm_ffi.so`'s 103 KiB because `core.cpp` is dominated by one giant `__pyx_pymod_exec_core` (~110 KiB) — per-TU `.text.unlikely` content gets absorbed into the dominant function during link-time comdat merging. ### Performance — benchmark_dlpack.py (CPU-only subset, two trials each, median) | Scenario | Baseline | With markers | Delta | |-------------------------------------|---------:|-------------:|--------:| | `tvm_ffi.nop(tvm_tensor x3)` | 112.8 ns | 112.2 ns | -0.49% | | `tvm_ffi.nop.autodlpack(torch[cpu])`| 308.6 ns | 303.4 ns | -1.69% | | `tvm_ffi.nop.autodlpack(numpy)` | 939.5 ns | 926.6 ns | -1.37% | | `tvm_ffi.nop+from_dlpack(torch)` | 791.5 ns | 787.0 ns | -0.56% | | `tvm_ffi.nop(int x3)` | 133.4 ns | 133.3 ns | -0.04% | | `tvm_ffi.nop()` | 90.4 ns | 89.2 ns | -1.33% | | `tvm.__dlpack__()` | 84.8 ns | 84.4 ns | -0.41% | All within ±2% run-to-run noise; no regression. The slight negative trend (faster) is within noise and not claimed as a real win. ### Notes on the investigation - `-ffunction-sections` is not required for cold separation. An earlier draft of this work enabled the flag and saw an extra ~4 KB packing gain; verified that the cold cluster appears with the markers alone via the default linker script's `.text.unlikely.*` grouping rule. Dropped the flag flip from this PR to keep the change header-only. - `--gc-sections` is intentionally not enabled. tvm-ffi has runtime registration patterns (`TVM_FFI_STATIC_INIT_BLOCK`) where symbols are referenced via runtime registries; aggressive section GC would need a separate audit of `__attribute__((used))` placement to avoid silently stripping registered globals. - The C ABI audit revision (keeping `TVMFFIError*` and `TVMFFIBacktrace` hot) was applied after measuring an initial draft that cold-marked them. Cross-DSO callers expect public exports to be ordinary hot-tier entry points, and once an error path enters the TLS setter it should be fast — not size-optimized. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
