On Wed, Jul 1, 2026 at 8:36 AM Tomasz Kaminski <[email protected]> wrote:
> > > On Wed, Jul 1, 2026 at 6:48 AM Anlai Lu <[email protected]> wrote: > >> I ran benchmarks comparing origin vs v3 (128B buffer) vs >> a prototype (template _NBuf, 256B for local_info). >> >> Full results below. >> >> Latency (ns/op, B-A-B-A interleaved) >> -------------------------------------------------- >> Type origin v3 improvement >> year_month_weekday_last 1025ns 368ns -64.1% >> year_month 567ns 243ns -57.1% >> month_day 557ns 259ns -53.5% >> weekday_indexed 506ns 247ns -51.2% >> year_month_day 227ns 158ns -30.4% >> local_time 435ns 316ns -27.4% >> sys_time 422ns 319ns -24.4% >> sys_days 221ns 170ns -23.1% >> hh_mm_ss 233ns 180ns -22.7% >> weekday 252ns 201ns -20.2% >> day 196ns 157ns -19.9% >> zoned_time 784ns 665ns -15.2% >> sys_info 1538ns 1484ns -3.5% >> local_info 1525ns 1483ns -2.8% >> >> Lower is better. All types show improvement; no regressions. >> >> Microarchitecture (perf stat, single run) >> -------------------------------------------------------------- >> Type Insn(orig) Insn(v3) Insn- Cyc(orig) >> Cyc(v3) Cyc- >> year_month_weekday_last 280.6B 85.2B -69.6% 119.8B >> 35.2B -70.6% >> month_day 161.4B 63.7B -60.5% 66.4B >> 27.5B -58.6% >> year_month 162.2B 64.6B -60.2% 67.7B >> 27.9B -58.8% >> weekday_indexed 143.8B 66.0B -54.1% 60.6B >> 28.0B -53.8% >> hh_mm_ss 71.4B 52.6B -26.3% 28.2B >> 21.8B -22.7% >> weekday 75.8B 56.2B -25.9% 30.9B >> 23.9B -22.7% >> year_month_day 68.7B 51.6B -24.9% 26.7B >> 20.2B -24.3% >> local_time 120.1B 92.2B -23.2% 48.2B >> 37.9B -21.4% >> sys_time 120.0B 92.5B -22.9% 48.1B >> 37.7B -21.6% >> day 62.9B 49.9B -20.7% 24.1B >> 19.6B -18.7% >> sys_days 68.7B 54.8B -20.2% 26.4B >> 21.7B -17.8% >> zoned_time 226.5B 195.2B -13.8% 89.8B >> 78.8B -12.2% >> sys_info 120.8B 115.3B -4.6% 45.7B >> 43.7B -4.4% >> local_info 120.6B 115.3B -4.4% 45.6B >> 44.5B -2.4% >> >> Sorted by Insn- (largest reduction first). >> "-" = reduction (negative = fewer instructions/cycles = improvement). >> All values negative: no regression in any type. >> >> Insn(orig)/Insn(v3) total instructions executed (less is better) >> Insn- instruction reduction (more negative = better) >> Cyc(orig)/Cyc(v3) total CPU cycles (less is better) >> Cyc- cycle reduction (more negative = better) >> >> Observations: >> - Stringstream types (first 4): 50-70% improvement. Eliminating the >> temporary stringstream and its repeated sentry constructions accounts >> for the majority of the gain. >> - format/vformat types (next 8): 13-27% improvement. The gain comes >> from eliminating the temporary std::string (heap allocation) and >> format-string parsing, replacing it with a stack buffer. >> - sys_info and local_info (last 2): ~4% instruction reduction, small >> but real. The dominant cost (~95%) is the internal formatter logic, >> which is identical between origin and v3. >> >> sys_info and local_info: origin vs 128B vs 256B >> ----------------------------------------------- >> B-A-B-A (20M iterations per run): >> >> sys_info: >> B1 origin: 1604ns A1 (256B buffer): 1528ns >> B2 origin: 1935ns A2 (256B buffer): 1491ns >> Avg origin: 1770ns Avg (256B buffer): 1510ns improvement: -14.7% >> >> local_info: >> B1 origin: 1599ns A1 (256B buffer): 1529ns >> B2 origin: 1934ns A2 (256B buffer): 1514ns >> Avg origin: 1766ns Avg (256B buffer): 1522ns improvement: -13.8% >> >> Origin varies by 300-400ns between runs (allocator state: SSO vs >> heap). 256B buffer version stays stable within 40ns. The 256B buffer >> avoids the heap fallback for the nonexistent case (171B output). >> 128B works for the common path but falls back to std::format for this. >> > That really promising result, so I would like you to pursue that > direction. > >> >> local_info output sizes: >> unique case: ~69B (fits in 128B) >> nonexistent case: 171B (requires 256B to avoid heap fallback) >> >> Template _NBuf parameter >> ------------------------ >> I suggest to add a non-type template parameter to allow per-type buffer >> tuning: >> >> template<size_t _NBuf = 128, typename _CharT, typename _Traits, >> > I would name the template parameter _BufSize > >> typename _Arg, typename... _OptLocale> >> __chrono_write(basic_ostream<_CharT, _Traits>& __os, >> const _Arg& __arg, const _OptLocale&... __loc); >> >> All types default to 128B. local_info uses 256B (only the nonexistent >> case exceeds 128). This makes the expected output length explicit at >> each call site and gives future types flexibility without touching the >> helper definition. >> > I like this approach. We could even go with reduced buffer sizes depending > on the type. This number is correlated with _Arg template argument so it > would > not cause additional template instantiation. > I mean reduced, to nearest power of 2 needed. > > Could you please prepare the revision with the changes listed above? Only > for the > second commit (I hope to land the test soon). > >> >> Test environment >> ---------------- >> CPU: 2x Intel Xeon E5-2660 v4 (Broadwell) @ 2.00 GHz (3.20 GHz >> turbo) >> 14 cores/socket, 2 threads/core, 28 cores / 56 threads total >> 2x NUMA nodes >> Memory: 125 GiB >> OS: Linux 5.15.0-126-generic (Ubuntu) x86_64 >> Compiler: GCC trunk (2026-06-28), -std=c++20 -O2 >> glibc: 2.35 >> >>
