On Wed, Jul 1, 2026 at 6:48 AM Anlai Lu <[email protected]> wrote: > I ran benchmarks comparing origin vs v3 (128B buffer) vs > a prototype (template _NBuf, 256B for local_info). > > Full results below. > > Latency (ns/op, B-A-B-A interleaved) > -------------------------------------------------- > Type origin v3 improvement > year_month_weekday_last 1025ns 368ns -64.1% > year_month 567ns 243ns -57.1% > month_day 557ns 259ns -53.5% > weekday_indexed 506ns 247ns -51.2% > year_month_day 227ns 158ns -30.4% > local_time 435ns 316ns -27.4% > sys_time 422ns 319ns -24.4% > sys_days 221ns 170ns -23.1% > hh_mm_ss 233ns 180ns -22.7% > weekday 252ns 201ns -20.2% > day 196ns 157ns -19.9% > zoned_time 784ns 665ns -15.2% > sys_info 1538ns 1484ns -3.5% > local_info 1525ns 1483ns -2.8% > > Lower is better. All types show improvement; no regressions. > > Microarchitecture (perf stat, single run) > -------------------------------------------------------------- > Type Insn(orig) Insn(v3) Insn- Cyc(orig) > Cyc(v3) Cyc- > year_month_weekday_last 280.6B 85.2B -69.6% 119.8B > 35.2B -70.6% > month_day 161.4B 63.7B -60.5% 66.4B > 27.5B -58.6% > year_month 162.2B 64.6B -60.2% 67.7B > 27.9B -58.8% > weekday_indexed 143.8B 66.0B -54.1% 60.6B > 28.0B -53.8% > hh_mm_ss 71.4B 52.6B -26.3% 28.2B > 21.8B -22.7% > weekday 75.8B 56.2B -25.9% 30.9B > 23.9B -22.7% > year_month_day 68.7B 51.6B -24.9% 26.7B > 20.2B -24.3% > local_time 120.1B 92.2B -23.2% 48.2B > 37.9B -21.4% > sys_time 120.0B 92.5B -22.9% 48.1B > 37.7B -21.6% > day 62.9B 49.9B -20.7% 24.1B > 19.6B -18.7% > sys_days 68.7B 54.8B -20.2% 26.4B > 21.7B -17.8% > zoned_time 226.5B 195.2B -13.8% 89.8B > 78.8B -12.2% > sys_info 120.8B 115.3B -4.6% 45.7B > 43.7B -4.4% > local_info 120.6B 115.3B -4.4% 45.6B > 44.5B -2.4% > > Sorted by Insn- (largest reduction first). > "-" = reduction (negative = fewer instructions/cycles = improvement). > All values negative: no regression in any type. > > Insn(orig)/Insn(v3) total instructions executed (less is better) > Insn- instruction reduction (more negative = better) > Cyc(orig)/Cyc(v3) total CPU cycles (less is better) > Cyc- cycle reduction (more negative = better) > > Observations: > - Stringstream types (first 4): 50-70% improvement. Eliminating the > temporary stringstream and its repeated sentry constructions accounts > for the majority of the gain. > - format/vformat types (next 8): 13-27% improvement. The gain comes > from eliminating the temporary std::string (heap allocation) and > format-string parsing, replacing it with a stack buffer. > - sys_info and local_info (last 2): ~4% instruction reduction, small > but real. The dominant cost (~95%) is the internal formatter logic, > which is identical between origin and v3. > > sys_info and local_info: origin vs 128B vs 256B > ----------------------------------------------- > B-A-B-A (20M iterations per run): > > sys_info: > B1 origin: 1604ns A1 (256B buffer): 1528ns > B2 origin: 1935ns A2 (256B buffer): 1491ns > Avg origin: 1770ns Avg (256B buffer): 1510ns improvement: -14.7% > > local_info: > B1 origin: 1599ns A1 (256B buffer): 1529ns > B2 origin: 1934ns A2 (256B buffer): 1514ns > Avg origin: 1766ns Avg (256B buffer): 1522ns improvement: -13.8% > > Origin varies by 300-400ns between runs (allocator state: SSO vs > heap). 256B buffer version stays stable within 40ns. The 256B buffer > avoids the heap fallback for the nonexistent case (171B output). > 128B works for the common path but falls back to std::format for this. > That really promising result, so I would like you to pursue that direction.
> > local_info output sizes: > unique case: ~69B (fits in 128B) > nonexistent case: 171B (requires 256B to avoid heap fallback) > > Template _NBuf parameter > ------------------------ > I suggest to add a non-type template parameter to allow per-type buffer > tuning: > > template<size_t _NBuf = 128, typename _CharT, typename _Traits, > I would name the template parameter _BufSize > typename _Arg, typename... _OptLocale> > __chrono_write(basic_ostream<_CharT, _Traits>& __os, > const _Arg& __arg, const _OptLocale&... __loc); > > All types default to 128B. local_info uses 256B (only the nonexistent > case exceeds 128). This makes the expected output length explicit at > each call site and gives future types flexibility without touching the > helper definition. > I like this approach. We could even go with reduced buffer sizes depending on the type. This number is correlated with _Arg template argument so it would not cause additional template instantiation. Could you please prepare the revision with the changes listed above? Only for the second commit (I hope to land the test soon). > > Test environment > ---------------- > CPU: 2x Intel Xeon E5-2660 v4 (Broadwell) @ 2.00 GHz (3.20 GHz turbo) > 14 cores/socket, 2 threads/core, 28 cores / 56 threads total > 2x NUMA nodes > Memory: 125 GiB > OS: Linux 5.15.0-126-generic (Ubuntu) x86_64 > Compiler: GCC trunk (2026-06-28), -std=c++20 -O2 > glibc: 2.35 > >
