Re: [PATCH v3 0/2] libstdc++: Optimize chrono ostream insertion via __chrono_write

Tomasz Kaminski Tue, 30 Jun 2026 23:41:07 -0700

On Wed, Jul 1, 2026 at 8:36 AM Tomasz Kaminski <[email protected]> wrote:


>
>
> On Wed, Jul 1, 2026 at 6:48 AM Anlai Lu <[email protected]> wrote:
>
>> I ran benchmarks comparing origin vs v3 (128B buffer) vs
>> a prototype (template _NBuf, 256B for local_info).
>>
>> Full results below.
>>
>> Latency (ns/op, B-A-B-A interleaved)
>> --------------------------------------------------
>>   Type                       origin    v3        improvement
>>   year_month_weekday_last     1025ns     368ns      -64.1%
>>   year_month                   567ns     243ns      -57.1%
>>   month_day                    557ns     259ns      -53.5%
>>   weekday_indexed              506ns     247ns      -51.2%
>>   year_month_day               227ns     158ns      -30.4%
>>   local_time                   435ns     316ns      -27.4%
>>   sys_time                     422ns     319ns      -24.4%
>>   sys_days                     221ns     170ns      -23.1%
>>   hh_mm_ss                     233ns     180ns      -22.7%
>>   weekday                      252ns     201ns      -20.2%
>>   day                          196ns     157ns      -19.9%
>>   zoned_time                   784ns     665ns      -15.2%
>>   sys_info                    1538ns    1484ns       -3.5%
>>   local_info                  1525ns    1483ns       -2.8%
>>
>>   Lower is better.  All types show improvement; no regressions.
>>
>> Microarchitecture (perf stat, single run)
>> --------------------------------------------------------------
>>   Type                       Insn(orig)  Insn(v3) Insn-    Cyc(orig)
>> Cyc(v3) Cyc-
>>   year_month_weekday_last       280.6B    85.2B   -69.6%     119.8B
>> 35.2B   -70.6%
>>   month_day                     161.4B    63.7B   -60.5%      66.4B
>> 27.5B   -58.6%
>>   year_month                    162.2B    64.6B   -60.2%      67.7B
>> 27.9B   -58.8%
>>   weekday_indexed               143.8B    66.0B   -54.1%      60.6B
>> 28.0B   -53.8%
>>   hh_mm_ss                       71.4B    52.6B   -26.3%      28.2B
>> 21.8B   -22.7%
>>   weekday                        75.8B    56.2B   -25.9%      30.9B
>> 23.9B   -22.7%
>>   year_month_day                 68.7B    51.6B   -24.9%      26.7B
>> 20.2B   -24.3%
>>   local_time                    120.1B    92.2B   -23.2%      48.2B
>> 37.9B   -21.4%
>>   sys_time                      120.0B    92.5B   -22.9%      48.1B
>> 37.7B   -21.6%
>>   day                            62.9B    49.9B   -20.7%      24.1B
>> 19.6B   -18.7%
>>   sys_days                       68.7B    54.8B   -20.2%      26.4B
>> 21.7B   -17.8%
>>   zoned_time                    226.5B   195.2B   -13.8%      89.8B
>> 78.8B   -12.2%
>>   sys_info                      120.8B   115.3B    -4.6%      45.7B
>> 43.7B    -4.4%
>>   local_info                    120.6B   115.3B    -4.4%      45.6B
>> 44.5B    -2.4%
>>
>>   Sorted by Insn- (largest reduction first).
>>   "-" = reduction (negative = fewer instructions/cycles = improvement).
>>   All values negative: no regression in any type.
>>
>>   Insn(orig)/Insn(v3)  total instructions executed (less is better)
>>   Insn-                instruction reduction (more negative = better)
>>   Cyc(orig)/Cyc(v3)    total CPU cycles (less is better)
>>   Cyc-                 cycle reduction (more negative = better)
>>
>> Observations:
>> - Stringstream types (first 4): 50-70% improvement.  Eliminating the
>>   temporary stringstream and its repeated sentry constructions accounts
>>   for the majority of the gain.
>> - format/vformat types (next 8): 13-27% improvement.  The gain comes
>>   from eliminating the temporary std::string (heap allocation) and
>>   format-string parsing, replacing it with a stack buffer.
>> - sys_info and local_info (last 2): ~4% instruction reduction, small
>>   but real.  The dominant cost (~95%) is the internal formatter logic,
>>   which is identical between origin and v3.
>>
>> sys_info and local_info: origin vs 128B vs 256B
>> -----------------------------------------------
>>   B-A-B-A (20M iterations per run):
>>
>>   sys_info:
>>     B1 origin: 1604ns    A1 (256B buffer): 1528ns
>>     B2 origin: 1935ns    A2 (256B buffer): 1491ns
>>     Avg origin: 1770ns   Avg (256B buffer): 1510ns   improvement: -14.7%
>>
>>   local_info:
>>     B1 origin: 1599ns    A1 (256B buffer): 1529ns
>>     B2 origin: 1934ns    A2 (256B buffer): 1514ns
>>     Avg origin: 1766ns   Avg (256B buffer): 1522ns   improvement: -13.8%
>>
>>   Origin varies by 300-400ns between runs (allocator state: SSO vs
>>   heap).  256B buffer version stays stable within 40ns.  The 256B buffer
>>   avoids the heap fallback for the nonexistent case (171B output).
>>   128B works for the common path but falls back to std::format for this.
>>
> That really promising result, so I would like you to pursue that
> direction.
>
>>
>>   local_info output sizes:
>>     unique case:       ~69B  (fits in 128B)
>>     nonexistent case:  171B  (requires 256B to avoid heap fallback)
>>
>> Template _NBuf parameter
>> ------------------------
>> I suggest to add a non-type template parameter to allow per-type buffer
>> tuning:
>>
>>   template<size_t _NBuf = 128, typename _CharT, typename _Traits,
>>
> I would name the template parameter _BufSize
>
>>            typename _Arg, typename... _OptLocale>
>>     __chrono_write(basic_ostream<_CharT, _Traits>& __os,
>>                    const _Arg& __arg, const _OptLocale&... __loc);
>>
>> All types default to 128B.  local_info uses 256B (only the nonexistent
>> case exceeds 128).  This makes the expected output length explicit at
>> each call site and gives future types flexibility without touching the
>> helper definition.
>>
> I like this approach. We could even go with reduced buffer sizes depending
> on the type. This number is correlated with _Arg template argument so it
> would
> not cause additional template instantiation.
>
I mean reduced, to nearest power of 2 needed.

>
> Could you please prepare the revision with the changes listed above? Only
> for the
> second commit (I hope to land the test soon).
>
>>
>> Test environment
>> ----------------
>>   CPU:     2x Intel Xeon E5-2660 v4 (Broadwell) @ 2.00 GHz (3.20 GHz
>> turbo)
>>            14 cores/socket, 2 threads/core, 28 cores / 56 threads total
>>            2x NUMA nodes
>>   Memory:  125 GiB
>>   OS:      Linux 5.15.0-126-generic (Ubuntu) x86_64
>>   Compiler: GCC trunk (2026-06-28), -std=c++20 -O2
>>   glibc:   2.35
>>
>>

Re: [PATCH v3 0/2] libstdc++: Optimize chrono ostream insertion via __chrono_write

Reply via email to