Hi Willy.

On 2026-03-16 (Mo.) 09:51, Willy Tarreau wrote:
Hi Aleks,

On Sun, Mar 15, 2026 at 03:12:47PM +0100, Aleksandar Lazic wrote:
Hi,


[snipp]

Patches
=======

The series is split into small steps:

1. use chunk builders for generated response headers
2. report the requested wait time in generated headers
3. increase the size of prebuilt response buffers
4. add a helper to fill HTX data in batches
5. switch the response path to the batched fill helper

Comments welcome, especially on whether this looks like a reasonable direction
for `haterm`.

Thanks for your work and your measurements.

This morning I had a look at your patch series and gave it a try on
our local lab (ARM and AMD). I'm seeing mixed results.

A few things in random order:
   - it's great that you got rid of that nasty snprintf(), I did the same
     on httpterm last year and gained a request rate percentage in the two
     digits. However this will not be measurable with 256k responses since
     the overhead of such a call compared to sending 256k is negligible.
     But that was on my radar as something to get rid of, so I'm grateful
     that you did it.

Your welcome :-)

   - the time measurement is not correct actually, it reports the requested
     time while the purpose was to indicate the generation time. It's useful
     when you don't know if you're measuring haterm's internal latency or
     network latency. I've uesd this a lot with httpterm in the past, where
     latencies of serveral milliseconds could happen on a saturated machine,
     and seeing the server denounce itself as the culprit was definitely
     helpful!

Thanks for explanation it was not clear to me which "time" should be here.
So the "time" here should be "/?t=<time>" or something else?

```
# rg time src/haterm.c
44:        " - /?t=<time>        wait <time> milliseconds before responding.\n"
552:    /* XXX TODO time?  XXX */
553:    snprintf(hdrbuf, sizeof(hdrbuf), "time=%ld ms", 0L);
559:    /* XXX TODO time? XXX */
560: snprintf(hdrbuf, sizeof(hdrbuf), "id=%s, code=%d, cache=%d,%s size=%lld, time=%d ms (%ld real)",
603:     * /?{s=<size>|r=<resp>|t=<time>|c=<cache>}[&{...}]
749:    if (tick_isset(hs->res_time) && !tick_is_expired(hs->res_time, now_ms)) 
{
803:                    hs->res_time = tick_add(now_ms, hs->res_wait);
804:                    task_schedule(t, hs->res_time);
834:            if (tick_isset(hs->res_wait) && !tick_isset(hs->res_time)) {
836:                    hs->res_time = tick_add(now_ms, hs->res_wait);
837:                    task_schedule(t, hs->res_time);
934:    hs->res_time = TICK_ETERNITY;
996:     * but the haproxy muxes do not support this. At this time
```

   - for the change on the RESPSIZE from 16kB to 128kB, I'm observing
     different results:
       - on the AMD, it's worse by a few percent (~2%). My guess is
         that it causes more L3 cache thrashing and that since this
         machine has a limited memory bandwidth (~35 GB/s), the larger
         worksize has a negative impact.

       - on the ARM, it's slightly better by ~2%. This machine has
         130 GB/s of memory bandwidth, which can easily amortize the
         extra RAM accesses and benefit from the slightly reduced
         scheduling.

       - on both machines, reducing the response size to 32kB and using
         tune.bufsize 65536 gives a huge boost (and only this combination).
         On the AMD, it's jumping from 167 to 269 Gbps (+61%). On the
         ARM, it's jumping from 397 to 605 Gbps (+52%). Note, this was on
         H1, which for now remains the only one we can reliably monitor.
         Even SSL benefits from this, even though less due to crypto.

Cool observation.

   - the last patch creating the loop to try to better fill the target
     buffer should theoretically not change anything, yet it does. On
     the AMD it degrades the performance by an extra 2-3%, while on the
     ARM it brings roughly 3%.

That's strange, I also not expected that.

All this makes me think that we're facing a scheduling issue: there's
apparently one combination (bufsize 32k + 64k respsize) which gives the
best performance, most likely because it's the largest chunk that can
be copied at once in L1 and allows all copies to remain cache-line
aligned, but that's speculation. The data are then not too large to
leave in one go while preserving TSO capabilities. Also, the fact that
the gain is so high on both architectures is not a coincidence, it's
not something just due do the CPU architecture but to the software
architecture. Also, I'm wondering why there's a change when you loop
over the HTX since in my opinion it ought to fill what it can at once,
and this alone deserves investigation and might help respond to the
first point.

With this lessons learned I will still use HATerm for my server benchmarks
just because it offers h1+h2+h3 as target for the reverse proxies :-)

If you want I can already merge your first patch (snprinf) as it's
definitely useful.

Yes, thanks.

Thank you!
Willy

Best regards
Aleks


Reply via email to