tqchen commented on PR #589:
URL: https://github.com/apache/tvm-ffi/pull/589#issuecomment-4446075606

   ## Detailed measurements
   
   ### Binary size — libtvm_ffi.so (stripped, Release / GCC 11.4 / ld.bfd 2.38 
/ x86_64 Linux)
   
   Isolation study: 4 builds from identical source state, toggling the macros 
via local redefinitions to isolate which macro contributes what.
   
   | Build                              | Stripped size | Delta       |
   |------------------------------------|--------------:|------------:|
   | baseline (both macros no-op)       |     1,887,800 |          —  |
   | `TVM_FFI_COLD_CODE` only           |     1,834,568 |  -53,232 B  |
   | `TVM_FFI_PREDICT_FALSE/TRUE` only  |     1,908,280 |  +20,480 B  |
   | both (this PR)                     |     1,842,728 |  -45,072 B  |
   
   Cold attribute alone (`current − no_cold`): -65,528 B (-3.4%).
   PREDICT macros alone (`current − no_predict`): +8,184 B (+0.4%).
   
   The cold attribute does most of the work for size. PREDICT macros are 
layout-only and slightly grow the binary — branch-prediction-driven basic-block 
reordering occasionally inserts extra jumps or duplicates code in the 
rearranged layout. The benefit of PREDICT is layout (hot fall-through stays 
contiguous), not size.
   
   ### Where the cold-attribute savings come from
   
   Assembly diff on `structural_equal.cc.o`:
   
   - `ErrorBuilder` ctor body: 1,045 B → 484 B per instance (-561 B per TU × 18 
TUs ≈ ~10 KB at object level).
   - `ErrorBuilder` dtor body: essentially unchanged (727 vs 717 bytes).
   - Number of dtor call sites: identical in both builds (18 each).
   
   The dominant mechanism is GCC's "optimized for size rather than speed" 
codegen on cold function bodies — `-Os`-style codegen replaces loop unrolling, 
vectorization, branchless tricks — not inlining suppression as I had initially 
conjectured. The dtor was `[[noreturn]]` and already out-of-line, so 
cold-marking didn't change its call-site count.
   
   Across all `tvm_ffi_objs` `.cc.o` files: cold adds ~100 KB to 
`.text.unlikely` and shrinks regular `.text` by ~180 KB, for a net ~52 KB 
pre-link savings (which matches the stripped-binary delta).
   
   ### Cold cluster bounds — libtvm_ffi.so
   
   `.text` total: 1.45 MiB at `0x8430..0x16aee4`.
   
   Cold cluster: `0x8430..~0x22700` = approximately **103 KiB at the head of 
`.text`** (7.3% of `.text`). Contents:
   
   - ErrorBuilder ctors and dtor
   - `TVMFFISegFaultHandler`, `TVMFFIInstallSignalHandler`
   - Many compiler-emitted `.cold` / `.part.N` / `.isra.N` split bodies from 
hot functions whose callees became transitively cold
   - libbacktrace symbol-related helpers that get split-cold
   
   Public C ABI exports verified to remain in the hot region — `TVMFFIError*` 
family at `0x6b820+`, `TVMFFIBacktrace` at `0x47140`, 
`TVMFFIObjectIncRef`/`DecRef` deeper in.
   
   ### Cython extension `core.abi3.so` (stripped)
   
   | Build         | Stripped | `.text`     | `.eh_frame` |
   |---------------|---------:|------------:|------------:|
   | baseline      |  788,200 |    602,875  |     39,628  |
   | with markers  |  788,200 |    603,308  |     39,732  |
   | Δ             |        0 |   +433 (+0.07%) |    +104  |
   
   Stripped on-disk size is unchanged — ELF page-alignment padding absorbs the 
sub-page `.text` delta. `.text` grows by 433 B (eh_frame metadata for 
cold-marked function epilogues + `__builtin_expect` bookkeeping).
   
   Cold cluster on the Cython side: ~2.7 KiB (`ForwardPyErrorToFFI` plus ~10 
GCC auto-cold-split `.cold` thunks from large Pyx wrappers). Much smaller than 
`libtvm_ffi.so`'s 103 KiB because `core.cpp` is dominated by one giant 
`__pyx_pymod_exec_core` (~110 KiB) — per-TU `.text.unlikely` content gets 
absorbed into the dominant function during link-time comdat merging.
   
   ### Performance — benchmark_dlpack.py (CPU-only subset, two trials each, 
median)
   
   | Scenario                            | Baseline | With markers | Delta   |
   |-------------------------------------|---------:|-------------:|--------:|
   | `tvm_ffi.nop(tvm_tensor x3)`        | 112.8 ns |    112.2 ns  | -0.49%  |
   | `tvm_ffi.nop.autodlpack(torch[cpu])`| 308.6 ns |    303.4 ns  | -1.69%  |
   | `tvm_ffi.nop.autodlpack(numpy)`     | 939.5 ns |    926.6 ns  | -1.37%  |
   | `tvm_ffi.nop+from_dlpack(torch)`    | 791.5 ns |    787.0 ns  | -0.56%  |
   | `tvm_ffi.nop(int x3)`               | 133.4 ns |    133.3 ns  | -0.04%  |
   | `tvm_ffi.nop()`                     |  90.4 ns |     89.2 ns  | -1.33%  |
   | `tvm.__dlpack__()`                  |  84.8 ns |     84.4 ns  | -0.41%  |
   
   All within ±2% run-to-run noise; no regression. The slight negative trend 
(faster) is within noise and not claimed as a real win.
   
   ### Notes on the investigation
   
   - `-ffunction-sections` is not required for cold separation. An earlier 
draft of this work enabled the flag and saw an extra ~4 KB packing gain; 
verified that the cold cluster appears with the markers alone via the default 
linker script's `.text.unlikely.*` grouping rule. Dropped the flag flip from 
this PR to keep the change header-only.
   - `--gc-sections` is intentionally not enabled. tvm-ffi has runtime 
registration patterns (`TVM_FFI_STATIC_INIT_BLOCK`) where symbols are 
referenced via runtime registries; aggressive section GC would need a separate 
audit of `__attribute__((used))` placement to avoid silently stripping 
registered globals.
   - The C ABI audit revision (keeping `TVMFFIError*` and `TVMFFIBacktrace` 
hot) was applied after measuring an initial draft that cold-marked them. 
Cross-DSO callers expect public exports to be ordinary hot-tier entry points, 
and once an error path enters the TLS setter it should be fast — not 
size-optimized.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to