tqchen opened a new pull request, #589: URL: https://github.com/apache/tvm-ffi/pull/589
## Summary Adds three header-only macros (`TVM_FFI_COLD_CODE`, `TVM_FFI_PREDICT_FALSE`, `TVM_FFI_PREDICT_TRUE`) to `tvm/ffi/base_details.h` and applies them to a small audited set of error-only helpers across `libtvm_ffi.so` and the Cython extension. No CMake changes. Downstream consumers that include `tvm/ffi/base_details.h` get the macros automatically and can apply them to their own helpers (notably TVM). ## Why A binary-layout audit of `libtvm_ffi.so` found that internal error helpers (`ErrorBuilder` ctors/dtor) live in the middle of `.text`, interleaved with hot C ABI dispatch and container code. They only run on error / setup / teardown paths, so keeping them out of the hot instruction stream improves icache locality without changing behavior. ## What ```cpp // include/tvm/ffi/base_details.h #if defined(__GNUC__) || defined(__clang__) #define TVM_FFI_COLD_CODE [[gnu::cold]] #else #define TVM_FFI_COLD_CODE #endif #if defined(__GNUC__) || defined(__clang__) #define TVM_FFI_PREDICT_FALSE(cond) (__builtin_expect(static_cast<bool>(cond), 0)) #define TVM_FFI_PREDICT_TRUE(cond) (__builtin_expect(static_cast<bool>(cond), 1)) #else #define TVM_FFI_PREDICT_FALSE(cond) (cond) #define TVM_FFI_PREDICT_TRUE(cond) (cond) #endif ``` `TVM_FFI_COLD_CODE` is applied only to functions that run exclusively on error / segfault / process-startup paths — never on regular teardown: - `details::ErrorBuilder` ctors and the `[[noreturn]]` destructor - `TVMFFISegFaultHandler` (internal) - `TVMFFIInstallSignalHandler` (startup-only) - `TVMFFIPyCallManager::ForwardPyErrorToFFI` (Python error forwarding) `TVMFFIPyCallbackClosure::Deleter` is intentionally NOT cold — deleters run on every callback destruction, which is normal-lifecycle frequency. C ABI exports stay hot per cross-DSO surface hygiene. `TVMFFIError*` family, `TVMFFIBacktrace`, and `SafeCallContext` setter methods all remain in the hot region; callers and tools expect them as ordinary entry points, and once an error path enters them they should be fast (the TLS setter should not be size-optimized). `TVM_FFI_PREDICT_FALSE` is applied to the central choke points for error checking: `TVM_FFI_CHECK_SAFE_CALL`, `TVM_FFI_CHECK`, `GlobalFunctionTable::Update`'s already-registered branch, and ~17 error-check branches inside the Python→FFI dispatchers in `tvm_ffi_python_helpers.h`. `TVM_FFI_PREDICT_TRUE` is used once, on the dispatch-map cache-hit branch (warm-state every call but the first). ## Mechanism GCC and Clang emit cold-marked functions into per-TU `.text.unlikely` sections. The default GNU linker script's `*(.text.unlikely .text.*_unlikely .text.unlikely.*)` rule gathers them into a contiguous slot inside `.text`. No `-ffunction-sections` flag required — cold separation works with the default build. On MSVC the macros are no-ops and the code is byte-identical to before. ## Measured impact Stripped `libtvm_ffi.so`, Release / GCC 11.4 / ld.bfd 2.38 / x86_64 Linux: | Build | Stripped size | Delta | |------------------------------------|--------------:|------------:| | baseline (both macros no-op) | 1,887,800 | — | | cold attribute only | 1,834,568 | -53,232 B | | predict macros only | 1,908,280 | +20,480 B | | both (this PR) | 1,842,728 | -45,072 B | The size win is dominated by `[[gnu::cold]]` triggering size-optimizing codegen on cold function bodies. Branch-prediction macros are layout-only (slightly grow the binary by ~8 KB but improve hot-path basic-block contiguity). Cold cluster on `libtvm_ffi.so`: about 103 KiB at the head of `.text` (~7.3% of `.text`), all the error helpers plus compiler-emitted `.cold` split-bodies clustered together. Cython extension `core.abi3.so`: stripped size unchanged (page-alignment padding absorbs the +433 B `.text` delta). Cold cluster includes `ForwardPyErrorToFFI` and ~10 auto-cold `.cold` thunks from large Pyx wrappers. ## Performance `benchmark_dlpack.py` CPU-only subset, two trials each, median: | benchmark | baseline | with markers | delta | |------------------------------------|----------:|-------------:|--------:| | `nop(tvm_tensor x3)` | 112.8 ns | 112.2 ns | -0.49% | | `nop.autodlpack(torch[cpu])` | 308.6 ns | 303.4 ns | -1.69% | | `nop.autodlpack(numpy)` | 939.5 ns | 926.6 ns | -1.37% | | `nop+from_dlpack(torch)` | 791.5 ns | 787.0 ns | -0.56% | | `nop(int x3)` | 133.4 ns | 133.3 ns | -0.04% | | `nop()` | 90.4 ns | 89.2 ns | -1.33% | | `__dlpack__()` | 84.8 ns | 84.4 ns | -0.41% | All within ±2% run-to-run noise. No regression. ## ABI / portability No ABI changes. The macros are header-only and the only observable difference is per-function attribute hints to the compiler. On MSVC every macro is a no-op (byte-identical codegen). On GCC and Clang, cold attribute lowers function-entry alignment and triggers `-Os`-style codegen on the marked body; branch-prediction macros only reorder basic blocks within the function. ## Test plan - [x] 355/355 active C++ tests pass. - [x] Python smoke test: `import tvm_ffi; print(__version__)` succeeds. - [x] `benchmark_dlpack.py` CPU subset shows no regression. - [x] Pre-commit clean. - [x] clang-tidy clean. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
