Re: [PATCH v3 0/2] libstdc++: Optimize chrono ostream insertion via __chrono_write

Tomasz Kaminski Tue, 30 Jun 2026 23:36:31 -0700

On Wed, Jul 1, 2026 at 6:48 AM Anlai Lu <[email protected]> wrote:

> I ran benchmarks comparing origin vs v3 (128B buffer) vs
> a prototype (template _NBuf, 256B for local_info).
>
> Full results below.
>
> Latency (ns/op, B-A-B-A interleaved)
> --------------------------------------------------
>   Type                       origin    v3        improvement
>   year_month_weekday_last     1025ns     368ns      -64.1%
>   year_month                   567ns     243ns      -57.1%
>   month_day                    557ns     259ns      -53.5%
>   weekday_indexed              506ns     247ns      -51.2%
>   year_month_day               227ns     158ns      -30.4%
>   local_time                   435ns     316ns      -27.4%
>   sys_time                     422ns     319ns      -24.4%
>   sys_days                     221ns     170ns      -23.1%
>   hh_mm_ss                     233ns     180ns      -22.7%
>   weekday                      252ns     201ns      -20.2%
>   day                          196ns     157ns      -19.9%
>   zoned_time                   784ns     665ns      -15.2%
>   sys_info                    1538ns    1484ns       -3.5%
>   local_info                  1525ns    1483ns       -2.8%
>
>   Lower is better.  All types show improvement; no regressions.
>
> Microarchitecture (perf stat, single run)
> --------------------------------------------------------------
>   Type                       Insn(orig)  Insn(v3) Insn-    Cyc(orig)
> Cyc(v3) Cyc-
>   year_month_weekday_last       280.6B    85.2B   -69.6%     119.8B
> 35.2B   -70.6%
>   month_day                     161.4B    63.7B   -60.5%      66.4B
> 27.5B   -58.6%
>   year_month                    162.2B    64.6B   -60.2%      67.7B
> 27.9B   -58.8%
>   weekday_indexed               143.8B    66.0B   -54.1%      60.6B
> 28.0B   -53.8%
>   hh_mm_ss                       71.4B    52.6B   -26.3%      28.2B
> 21.8B   -22.7%
>   weekday                        75.8B    56.2B   -25.9%      30.9B
> 23.9B   -22.7%
>   year_month_day                 68.7B    51.6B   -24.9%      26.7B
> 20.2B   -24.3%
>   local_time                    120.1B    92.2B   -23.2%      48.2B
> 37.9B   -21.4%
>   sys_time                      120.0B    92.5B   -22.9%      48.1B
> 37.7B   -21.6%
>   day                            62.9B    49.9B   -20.7%      24.1B
> 19.6B   -18.7%
>   sys_days                       68.7B    54.8B   -20.2%      26.4B
> 21.7B   -17.8%
>   zoned_time                    226.5B   195.2B   -13.8%      89.8B
> 78.8B   -12.2%
>   sys_info                      120.8B   115.3B    -4.6%      45.7B
> 43.7B    -4.4%
>   local_info                    120.6B   115.3B    -4.4%      45.6B
> 44.5B    -2.4%
>
>   Sorted by Insn- (largest reduction first).
>   "-" = reduction (negative = fewer instructions/cycles = improvement).
>   All values negative: no regression in any type.
>
>   Insn(orig)/Insn(v3)  total instructions executed (less is better)
>   Insn-                instruction reduction (more negative = better)
>   Cyc(orig)/Cyc(v3)    total CPU cycles (less is better)
>   Cyc-                 cycle reduction (more negative = better)
>
> Observations:
> - Stringstream types (first 4): 50-70% improvement.  Eliminating the
>   temporary stringstream and its repeated sentry constructions accounts
>   for the majority of the gain.
> - format/vformat types (next 8): 13-27% improvement.  The gain comes
>   from eliminating the temporary std::string (heap allocation) and
>   format-string parsing, replacing it with a stack buffer.
> - sys_info and local_info (last 2): ~4% instruction reduction, small
>   but real.  The dominant cost (~95%) is the internal formatter logic,
>   which is identical between origin and v3.
>
> sys_info and local_info: origin vs 128B vs 256B
> -----------------------------------------------
>   B-A-B-A (20M iterations per run):
>
>   sys_info:
>     B1 origin: 1604ns    A1 (256B buffer): 1528ns
>     B2 origin: 1935ns    A2 (256B buffer): 1491ns
>     Avg origin: 1770ns   Avg (256B buffer): 1510ns   improvement: -14.7%
>
>   local_info:
>     B1 origin: 1599ns    A1 (256B buffer): 1529ns
>     B2 origin: 1934ns    A2 (256B buffer): 1514ns
>     Avg origin: 1766ns   Avg (256B buffer): 1522ns   improvement: -13.8%
>
>   Origin varies by 300-400ns between runs (allocator state: SSO vs
>   heap).  256B buffer version stays stable within 40ns.  The 256B buffer
>   avoids the heap fallback for the nonexistent case (171B output).
>   128B works for the common path but falls back to std::format for this.
>
That really promising result, so I would like you to pursue that direction.


>
>   local_info output sizes:
>     unique case:       ~69B  (fits in 128B)
>     nonexistent case:  171B  (requires 256B to avoid heap fallback)
>
> Template _NBuf parameter
> ------------------------
> I suggest to add a non-type template parameter to allow per-type buffer
> tuning:
>
>   template<size_t _NBuf = 128, typename _CharT, typename _Traits,
>
I would name the template parameter _BufSize

>            typename _Arg, typename... _OptLocale>
>     __chrono_write(basic_ostream<_CharT, _Traits>& __os,
>                    const _Arg& __arg, const _OptLocale&... __loc);
>
> All types default to 128B.  local_info uses 256B (only the nonexistent
> case exceeds 128).  This makes the expected output length explicit at
> each call site and gives future types flexibility without touching the
> helper definition.
>
I like this approach. We could even go with reduced buffer sizes depending
on the type. This number is correlated with _Arg template argument so it
would
not cause additional template instantiation.

Could you please prepare the revision with the changes listed above? Only
for the
second commit (I hope to land the test soon).

>
> Test environment
> ----------------
>   CPU:     2x Intel Xeon E5-2660 v4 (Broadwell) @ 2.00 GHz (3.20 GHz turbo)
>            14 cores/socket, 2 threads/core, 28 cores / 56 threads total
>            2x NUMA nodes
>   Memory:  125 GiB
>   OS:      Linux 5.15.0-126-generic (Ubuntu) x86_64
>   Compiler: GCC trunk (2026-06-28), -std=c++20 -O2
>   glibc:   2.35
>
>

Re: [PATCH v3 0/2] libstdc++: Optimize chrono ostream insertion via __chrono_write

Reply via email to