Hi Remi,
On 02. 10. 23 13:53, Remi Gacogne via dnsdist wrote:
Hi Ales,
On 25/09/2023 16:09, Aleš Rygl via dnsdist wrote:
I would to kindly ask for help or and advice. I have just
upgraded one of our dnsdist instances from 1.7.4 do 1.8.4 together
with OS upgrade (Debian 11.7 to 12.1). Everything works fine, no
issues observed apart some deprecated config references. What is a
big surprise to me is CPU usage. The newer version has nearly two
times higher CPU consumption in userspace. I am nearly at 80% CPU
with 16 physical cores (was about 40%). We have a lot of TLS (DoT)
sessions (30k) and 60kqps in total (30k via DoT) here. The latency
measured by dnsdist went up also. We are collecting all the metrics
dnsdist produces via graphite so I can check counters, what could be
wrong.
Wow, that's awful. It's the first time I hear about such a regression,
and I really would like to understand what is going on.
1/ Are you using our packages, compiling yourself, or perhaps using
the Debian ones?
2/ Do you think it would be possible for you to try downgrading the
instance to 1.7.4 on Debian 12.1? It might help us pinpointing whether
the issue is related to a system change (I have seen people complain
about the performance of OpenSSL 3.0.x compared to 1.1.1x, for example).
3/ Would you mind sharing your configuration?
4/ And finally, do you think it would be possible for you to collect a
perf trace on this instance? It would require installing linux-perf,
if possible the debug symbols for dnsdist (dnsdist-dbgsym) then
running 'perf record --call-graph dwarf -p <pid of running dnsdist
process> -o </path/to/output/file>' for a few dozens of seconds to
collect a trace, stopping it with Ctrl+C and finally getting a report
with "perf report -i </path/to/previous/file> --stdio". It should tell
us where the CPU usage is going.
Best regards,
Thanks for your response. After some deep documentation reading and
config tweaking I am nearly on the previous values regarding CPU load,
apart from latency, which is still higher (1.3ms -> 2.3ms). I suspect a
different way the latency is likely computed (I noticed a new set of
latency counters for TLS, TCP, etc.) here. The key configuration
parameter is setMaxTCPClientThreads(). Changing anything else (cache
shards, number of listeners, etc.) has nearly no impact. We had 256 with
1.7.4. now it is 16. Going up here means a rapid increase of CPU load,
having less than 16 means dropping TCP connections in showTCPStats(),
where Queued hits Max Queued. Insane values like 1024 kills the CPU. We
have a physical server with 16 phys. cores, OS sees 32 cores.
Back to your questions:
1/ from your repos
2/ yes, I could try it, the thing is that 1.7.4 for Bullseye crashes on
Bookworm wit TLS enabled and there a no packages of 1.7.4 for Bookworm
in your repo
3/ sure, I will do so
4/ no problem
Best regards
Ales
_______________________________________________
dnsdist mailing list
dnsdist@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/dnsdist