On Wed, Jul 1, 2026 at 8:40 AM Tomasz Kaminski <[email protected]> wrote:
> > > On Wed, Jul 1, 2026 at 8:36 AM Tomasz Kaminski <[email protected]> > wrote: > >> >> >> On Wed, Jul 1, 2026 at 6:48 AM Anlai Lu <[email protected]> wrote: >> >>> I ran benchmarks comparing origin vs v3 (128B buffer) vs >>> a prototype (template _NBuf, 256B for local_info). >>> >>> Full results below. >>> >>> Latency (ns/op, B-A-B-A interleaved) >>> -------------------------------------------------- >>> Type origin v3 improvement >>> year_month_weekday_last 1025ns 368ns -64.1% >>> year_month 567ns 243ns -57.1% >>> month_day 557ns 259ns -53.5% >>> weekday_indexed 506ns 247ns -51.2% >>> year_month_day 227ns 158ns -30.4% >>> local_time 435ns 316ns -27.4% >>> sys_time 422ns 319ns -24.4% >>> sys_days 221ns 170ns -23.1% >>> hh_mm_ss 233ns 180ns -22.7% >>> weekday 252ns 201ns -20.2% >>> day 196ns 157ns -19.9% >>> zoned_time 784ns 665ns -15.2% >>> sys_info 1538ns 1484ns -3.5% >>> local_info 1525ns 1483ns -2.8% >>> >>> Lower is better. All types show improvement; no regressions. >>> >>> Microarchitecture (perf stat, single run) >>> -------------------------------------------------------------- >>> Type Insn(orig) Insn(v3) Insn- Cyc(orig) >>> Cyc(v3) Cyc- >>> year_month_weekday_last 280.6B 85.2B -69.6% 119.8B >>> 35.2B -70.6% >>> month_day 161.4B 63.7B -60.5% 66.4B >>> 27.5B -58.6% >>> year_month 162.2B 64.6B -60.2% 67.7B >>> 27.9B -58.8% >>> weekday_indexed 143.8B 66.0B -54.1% 60.6B >>> 28.0B -53.8% >>> hh_mm_ss 71.4B 52.6B -26.3% 28.2B >>> 21.8B -22.7% >>> weekday 75.8B 56.2B -25.9% 30.9B >>> 23.9B -22.7% >>> year_month_day 68.7B 51.6B -24.9% 26.7B >>> 20.2B -24.3% >>> local_time 120.1B 92.2B -23.2% 48.2B >>> 37.9B -21.4% >>> sys_time 120.0B 92.5B -22.9% 48.1B >>> 37.7B -21.6% >>> day 62.9B 49.9B -20.7% 24.1B >>> 19.6B -18.7% >>> sys_days 68.7B 54.8B -20.2% 26.4B >>> 21.7B -17.8% >>> zoned_time 226.5B 195.2B -13.8% 89.8B >>> 78.8B -12.2% >>> sys_info 120.8B 115.3B -4.6% 45.7B >>> 43.7B -4.4% >>> local_info 120.6B 115.3B -4.4% 45.6B >>> 44.5B -2.4% >>> >>> Sorted by Insn- (largest reduction first). >>> "-" = reduction (negative = fewer instructions/cycles = improvement). >>> All values negative: no regression in any type. >>> >>> Insn(orig)/Insn(v3) total instructions executed (less is better) >>> Insn- instruction reduction (more negative = better) >>> Cyc(orig)/Cyc(v3) total CPU cycles (less is better) >>> Cyc- cycle reduction (more negative = better) >>> >>> Observations: >>> - Stringstream types (first 4): 50-70% improvement. Eliminating the >>> temporary stringstream and its repeated sentry constructions accounts >>> for the majority of the gain. >>> - format/vformat types (next 8): 13-27% improvement. The gain comes >>> from eliminating the temporary std::string (heap allocation) and >>> format-string parsing, replacing it with a stack buffer. >>> - sys_info and local_info (last 2): ~4% instruction reduction, small >>> but real. The dominant cost (~95%) is the internal formatter logic, >>> which is identical between origin and v3. >>> >>> sys_info and local_info: origin vs 128B vs 256B >>> ----------------------------------------------- >>> B-A-B-A (20M iterations per run): >>> >>> sys_info: >>> B1 origin: 1604ns A1 (256B buffer): 1528ns >>> B2 origin: 1935ns A2 (256B buffer): 1491ns >>> Avg origin: 1770ns Avg (256B buffer): 1510ns improvement: -14.7% >>> >>> local_info: >>> B1 origin: 1599ns A1 (256B buffer): 1529ns >>> B2 origin: 1934ns A2 (256B buffer): 1514ns >>> Avg origin: 1766ns Avg (256B buffer): 1522ns improvement: -13.8% >>> >>> Origin varies by 300-400ns between runs (allocator state: SSO vs >>> heap). 256B buffer version stays stable within 40ns. The 256B buffer >>> avoids the heap fallback for the nonexistent case (171B output). >>> 128B works for the common path but falls back to std::format for this. >>> >> That really promising result, so I would like you to pursue that >> direction. >> >>> >>> local_info output sizes: >>> unique case: ~69B (fits in 128B) >>> nonexistent case: 171B (requires 256B to avoid heap fallback) >>> >>> Template _NBuf parameter >>> ------------------------ >>> I suggest to add a non-type template parameter to allow per-type buffer >>> tuning: >>> >>> template<size_t _NBuf = 128, typename _CharT, typename _Traits, >>> >> I would name the template parameter _BufSize >> >>> typename _Arg, typename... _OptLocale> >>> __chrono_write(basic_ostream<_CharT, _Traits>& __os, >>> const _Arg& __arg, const _OptLocale&... __loc); >>> >>> All types default to 128B. local_info uses 256B (only the nonexistent >>> case exceeds 128). This makes the expected output length explicit at >>> each call site and gives future types flexibility without touching the >>> helper definition. >>> >> I like this approach. We could even go with reduced buffer sizes >> depending >> on the type. This number is correlated with _Arg template argument so it >> would >> not cause additional template instantiation. >> > I mean reduced, to nearest power of 2 needed. > We will need to use 128B still for anything that includes the localized name of the month or weekday. > >> Could you please prepare the revision with the changes listed above? Only >> for the >> second commit (I hope to land the test soon). >> >>> >>> Test environment >>> ---------------- >>> CPU: 2x Intel Xeon E5-2660 v4 (Broadwell) @ 2.00 GHz (3.20 GHz >>> turbo) >>> 14 cores/socket, 2 threads/core, 28 cores / 56 threads total >>> 2x NUMA nodes >>> Memory: 125 GiB >>> OS: Linux 5.15.0-126-generic (Ubuntu) x86_64 >>> Compiler: GCC trunk (2026-06-28), -std=c++20 -O2 >>> glibc: 2.35 >>> >>>
