I ran benchmarks comparing origin vs v3 (128B buffer) vs
a prototype (template _NBuf, 256B for local_info).
Full results below.
Latency (ns/op, B-A-B-A interleaved)
--------------------------------------------------
Type origin v3 improvement
year_month_weekday_last 1025ns 368ns -64.1%
year_month 567ns 243ns -57.1%
month_day 557ns 259ns -53.5%
weekday_indexed 506ns 247ns -51.2%
year_month_day 227ns 158ns -30.4%
local_time 435ns 316ns -27.4%
sys_time 422ns 319ns -24.4%
sys_days 221ns 170ns -23.1%
hh_mm_ss 233ns 180ns -22.7%
weekday 252ns 201ns -20.2%
day 196ns 157ns -19.9%
zoned_time 784ns 665ns -15.2%
sys_info 1538ns 1484ns -3.5%
local_info 1525ns 1483ns -2.8%
Lower is better. All types show improvement; no regressions.
Microarchitecture (perf stat, single run)
--------------------------------------------------------------
Type Insn(orig) Insn(v3) Insn- Cyc(orig) Cyc(v3)
Cyc-
year_month_weekday_last 280.6B 85.2B -69.6% 119.8B 35.2B
-70.6%
month_day 161.4B 63.7B -60.5% 66.4B 27.5B
-58.6%
year_month 162.2B 64.6B -60.2% 67.7B 27.9B
-58.8%
weekday_indexed 143.8B 66.0B -54.1% 60.6B 28.0B
-53.8%
hh_mm_ss 71.4B 52.6B -26.3% 28.2B 21.8B
-22.7%
weekday 75.8B 56.2B -25.9% 30.9B 23.9B
-22.7%
year_month_day 68.7B 51.6B -24.9% 26.7B 20.2B
-24.3%
local_time 120.1B 92.2B -23.2% 48.2B 37.9B
-21.4%
sys_time 120.0B 92.5B -22.9% 48.1B 37.7B
-21.6%
day 62.9B 49.9B -20.7% 24.1B 19.6B
-18.7%
sys_days 68.7B 54.8B -20.2% 26.4B 21.7B
-17.8%
zoned_time 226.5B 195.2B -13.8% 89.8B 78.8B
-12.2%
sys_info 120.8B 115.3B -4.6% 45.7B 43.7B
-4.4%
local_info 120.6B 115.3B -4.4% 45.6B 44.5B
-2.4%
Sorted by Insn- (largest reduction first).
"-" = reduction (negative = fewer instructions/cycles = improvement).
All values negative: no regression in any type.
Insn(orig)/Insn(v3) total instructions executed (less is better)
Insn- instruction reduction (more negative = better)
Cyc(orig)/Cyc(v3) total CPU cycles (less is better)
Cyc- cycle reduction (more negative = better)
Observations:
- Stringstream types (first 4): 50-70% improvement. Eliminating the
temporary stringstream and its repeated sentry constructions accounts
for the majority of the gain.
- format/vformat types (next 8): 13-27% improvement. The gain comes
from eliminating the temporary std::string (heap allocation) and
format-string parsing, replacing it with a stack buffer.
- sys_info and local_info (last 2): ~4% instruction reduction, small
but real. The dominant cost (~95%) is the internal formatter logic,
which is identical between origin and v3.
sys_info and local_info: origin vs 128B vs 256B
-----------------------------------------------
B-A-B-A (20M iterations per run):
sys_info:
B1 origin: 1604ns A1 (256B buffer): 1528ns
B2 origin: 1935ns A2 (256B buffer): 1491ns
Avg origin: 1770ns Avg (256B buffer): 1510ns improvement: -14.7%
local_info:
B1 origin: 1599ns A1 (256B buffer): 1529ns
B2 origin: 1934ns A2 (256B buffer): 1514ns
Avg origin: 1766ns Avg (256B buffer): 1522ns improvement: -13.8%
Origin varies by 300-400ns between runs (allocator state: SSO vs
heap). 256B buffer version stays stable within 40ns. The 256B buffer
avoids the heap fallback for the nonexistent case (171B output).
128B works for the common path but falls back to std::format for this.
local_info output sizes:
unique case: ~69B (fits in 128B)
nonexistent case: 171B (requires 256B to avoid heap fallback)
Template _NBuf parameter
------------------------
I suggest to add a non-type template parameter to allow per-type buffer tuning:
template<size_t _NBuf = 128, typename _CharT, typename _Traits,
typename _Arg, typename... _OptLocale>
__chrono_write(basic_ostream<_CharT, _Traits>& __os,
const _Arg& __arg, const _OptLocale&... __loc);
All types default to 128B. local_info uses 256B (only the nonexistent
case exceeds 128). This makes the expected output length explicit at
each call site and gives future types flexibility without touching the
helper definition.
Test environment
----------------
CPU: 2x Intel Xeon E5-2660 v4 (Broadwell) @ 2.00 GHz (3.20 GHz turbo)
14 cores/socket, 2 threads/core, 28 cores / 56 threads total
2x NUMA nodes
Memory: 125 GiB
OS: Linux 5.15.0-126-generic (Ubuntu) x86_64
Compiler: GCC trunk (2026-06-28), -std=c++20 -O2
glibc: 2.35