On Thu, Feb 20, 2020 at 10:49:35PM +0000, Simon Kelley wrote: > On 17/02/2020 14:37, Geert Stappers wrote: > > On 17-02-2020 14:31, Donald Sharp wrote: > > > >> Running: > >> > >> sharpd@eva:~/dnsmasq$ /sbin/dnsmasq --version > >> Dnsmasq version 2.80 Copyright (c) 2000-2018 Simon Kelley > > > > 2018, no short-git-hashes nor simular indicators on source version. > > > > > >> Compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua > >> TFTP conntrack ipset auth DNSSEC loop-detect inotify dumpfile > >> ---- > >> > >> When I install several hundred thousand routes into the kernel and > >> remove them( or some variation thereof ), dnsmasq eventually ends up > >> running 100% cpu: > >> > >> top - 18:45:18 up 1 day, 7:44, 1 user, load average: 2.70, 2.65, 2.34 > >> Tasks: 424 total, 3 running, 421 sleeping, 0 stopped, 0 zombie > >> %Cpu(s): 12.1 us, 6.9 sy, 0.0 ni, 80.2 id, 0.0 wa, 0.0 hi, 0.7 > >> si, 0.0 st > >> MiB Mem : 32131.3 total, 19483.6 free, 6620.3 used, 6027.4 > >> buff/cache > >> MiB Swap: 32718.0 total, 31693.0 free, 1025.0 used. 24698.2 avail Mem > >> > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > >> COMMAND > >> 293183 nobody 20 0 11040 2040 1688 R 99.7 0.0 148:48.40 > >> dnsmasq > > > > > > The "CPU 100%" made me do `git log` and a "find" on 'CPU'. I found > > > > > > commit df6636bff61aa53ed7ad4b34d940805193c0bc74 > > Author: Florent Fourcot <florent.four...@wifirst.fr> > > Date: Mon Feb 11 17:04:44 2019 +0100 > > > > lease: prune lease as soon as expired > > > > We detected a performance issue on a dnsmasq running many dhcp sessions > > (more than 10 000). At the end of the day, the server was only releasing > > old DHCP leases but was consuming a lot of CPU. > > > > It looks like curent dhcp pruning: > > 1) it's pruning old sessions (iterate on all current leases). It's > > important to note that it's only pruning session expired since more > > than one second > > 2) it's looking for next lease to expire (iterate on all current leases > > again) > > 3) it launchs an alarm to catch next expiration found in step 2). This > > value can be zero for leases just expired (but not pruned). > > > > So, for a second, dnsmasq could fall in a "prune loop" by doing: > > * Not pruning anything, since difftime() is not > 0 > > * Run alarm again with zero as argument > > > > On a server with very large number of leases and releasing often > > sessions, that can waste a very big CPU time. > > > > Signed-off-by: Florent Fourcot <florent.four...@wifirst.fr> > > > > > > > > > >> > >> strace output: > >> > >> poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, > >> events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, > >> events=POLLIN}], 6, -1) = 1 ([{fd=4, revents=POLLERR}]) > >> .... > >> poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, > >> events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, > >> events=POLLIN}], 6, -1) = 1 ([{fd=4, revents=POLLERR}]) > >> poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, > >> events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, > >> events=POLLIN}], 6, -1) = 1 ([{fd=4, revents=PO^Cstrace: Process > >> 293183 detached > >> > >> I can pretty much make this happen at will. What can I provide to > >> help debug this? > > > > Start with stating how recent the source is that you are using. > > > > > >> > >> As a side note, I was not placing these routes into the default linux > >> routing table. Does dnsmasq need to be paying attention to these routes? > > > > Side notes in a separate thread please. > > > > > >> > >> donald > >> > > > > Regards > > > > Geert Stappers > > > > Geert, you're confusing things.
Sorry for matching CPU load with CPU load. > It's perfectly clear that the process is > running 100% CPU beacuse the poll() calls are returning an error which > the code is not expecting and doesn't handle. It just calls poll() > again, and because the error wasn't cleared, poll returns immediately > again, rinse and repeat. > > The solution is to handle the error (it's not obvious to me how to do > that) or to avoid creating the error condition in the first place. > > To get further, we need to know which socket is erroring. It's file > descriptor four in the strace, but is that the netlink socket, or a DHCP > socket or a socket used to talk DNS upstream, or DNS downstream. We > don't know without further information. Geert Stappers -- Silence is hard to parse _______________________________________________ Dnsmasq-discuss mailing list Dnsmasq-discuss@lists.thekelleys.org.uk http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss