Hi Remi,

On 02. 10. 23 13:53, Remi Gacogne via dnsdist wrote:
Hi Ales,

On 25/09/2023 16:09, Aleš Rygl via dnsdist wrote:
    I would to kindly ask for help or and advice. I have just upgraded one of our dnsdist instances from 1.7.4 do 1.8.4 together with OS upgrade (Debian 11.7 to 12.1). Everything works fine, no issues observed apart some deprecated config references. What is a big surprise to me is CPU usage. The newer version has nearly two times higher CPU consumption in userspace. I am nearly at 80% CPU with 16 physical cores (was about 40%). We have a lot of TLS (DoT) sessions (30k) and 60kqps in total (30k via DoT) here. The latency measured by dnsdist went up also. We are collecting all the metrics dnsdist produces via graphite so I can check counters, what could be wrong.

Wow, that's awful. It's the first time I hear about such a regression, and I really would like to understand what is going on. 1/ Are you using our packages, compiling yourself, or perhaps using the Debian ones? 2/ Do you think it would be possible for you to try downgrading the instance to 1.7.4 on Debian 12.1? It might help us pinpointing whether the issue is related to a system change (I have seen people complain about the performance of OpenSSL 3.0.x compared to 1.1.1x, for example).
3/ Would you mind sharing your configuration?
4/ And finally, do you think it would be possible for you to collect a perf trace on this instance? It would require installing linux-perf, if possible the debug symbols for dnsdist (dnsdist-dbgsym) then running 'perf record --call-graph dwarf -p <pid of running dnsdist process> -o </path/to/output/file>' for a few dozens of seconds to collect a trace, stopping it with Ctrl+C and finally getting a report with "perf report -i </path/to/previous/file> --stdio". It should tell us where the CPU usage is going.

Best regards,

    Thanks for your response. After some deep documentation reading and config tweaking I am nearly on the previous values regarding CPU load, apart from latency, which is still higher (1.3ms -> 2.3ms). I suspect a different way the latency is likely computed (I noticed a new set of latency counters for TLS, TCP, etc.) here.  The key configuration parameter is setMaxTCPClientThreads(). Changing anything else (cache shards, number of listeners, etc.) has nearly no impact. We had 256 with 1.7.4. now it is 16. Going up here means a rapid increase of CPU load, having less than 16 means dropping TCP connections in showTCPStats(), where Queued hits Max Queued. Insane values like 1024 kills the CPU. We have a physical server with 16 phys. cores, OS sees 32 cores.

Back to your questions:

1/ from your repos
2/ yes, I could try it, the thing is that 1.7.4 for Bullseye crashes on Bookworm wit TLS enabled and there a no packages of 1.7.4 for Bookworm in your repo
3/ sure, I will do so
4/ no problem

Best regards

Ales





_______________________________________________
dnsdist mailing list
dnsdist@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/dnsdist

Reply via email to