Re: [Dnsmasq-discuss] Seg. fault in cache.c after commt b6f926fb
On Wed, Sep 19, 2018 at 1:44 PM Simon Kelley wrote: > This all makes me slightly uneasy. I think the "out of memory" > explanation for the crashes you are seeing is not a good one. No, I agree. I have compiled an OpenWRT image without the fix and installed it on my device, and I am trying to reproduce the issue. Will let you know when or if I figure out something. BR, Kristian ___ Dnsmasq-discuss mailing list Dnsmasq-discuss@lists.thekelleys.org.uk http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
Re: [Dnsmasq-discuss] Seg. fault in cache.c after commt b6f926fb
On 19/09/18 08:59, Kristian Evensen wrote: > Hi Simon, > > Thanks for a quick reply. > > On Wed, Sep 19, 2018 at 12:23 AM Simon Kelley wrote: >> Thanks for the report. The obvious explanation is that whine_malloc() is >> returning NULL, and the code should handle that. whine_malloc only >> returns NULL if the system cannot allocate any more memory, which is >> possible, but unlikely. Is your router very short on memory? > > No, the router has plenty of memory (2GB) and I don't see the "failed > to allocate"-message, so I guess whine_malloc() can't be the culprit. > Since I am using OpenWRT, there could be some defines affecting the > line numbers. I tried to read up on how ifdefs affects line numbers in > gdb backtraces to see if the error could be somewhere else than the > "default" line 1437, but I unfortunately couldn't find anything. > Probably my google-foo is a bit rusty. > > When looking over my notes, I see that I have made the following > observations related to this bug: > > * Crash happens quite rarely. > * I have only seen the bug right after boot. > * When the bug strikes, dnsmasq will enter a crash loop and never > recover. I.e., I can restart dnsmasq as many times as I like, crash > always happens. > * If I start dnsmasq manually and run it in the foreground after a > crash, I also see the error. > > So there seems to be something in the system causing this error, but I > can't figure out what. > >> I think the best solution is to wrap all of >> >> *crecp = *source; >> crecp->flags &= ~(F_IPV4 | F_IPV6 | F_CNAME | F_DNSKEY | F_DS | >> F_REVERSE); >> crecp->flags |= F_NAMEP; >> crecp->name.namep = name; >> >> cache_hash(crecp); >> >> with >> >> if (crecp) >> { >> } > > Thanks, this is basically the same as my current fix, so I can already > report that it is good :) > This all makes me slightly uneasy. I think the "out of memory" explanation for the crashes you are seeing is not a good one. The patch is definitely needed. The philosophy for memory allocation failures is that the code should be not terminate: it logs an error and continues, but not necessarily behaving correctly. That's about the best you can do. Unfortunately, with the patch, if crecp is NULL at this point for some other reason, it will now fail silently, and behave wrongly (the non-terminal cache entry will not be created) It would be really good to find out what actually causes this, and preferably a way to reproduce it. Cheers, Simon. ___ Dnsmasq-discuss mailing list Dnsmasq-discuss@lists.thekelleys.org.uk http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
Re: [Dnsmasq-discuss] Seg. fault in cache.c after commt b6f926fb
> On 19 Sep 2018, at 08:59, Kristian Evensen wrote: > > Hi Simon, > > Thanks for a quick reply. > > On Wed, Sep 19, 2018 at 12:23 AM Simon Kelley wrote: >> Thanks for the report. The obvious explanation is that whine_malloc() is >> returning NULL, and the code should handle that. whine_malloc only >> returns NULL if the system cannot allocate any more memory, which is >> possible, but unlikely. Is your router very short on memory? > > No, the router has plenty of memory (2GB) and I don't see the "failed > to allocate"-message, so I guess whine_malloc() can't be the culprit. > Since I am using OpenWRT, there could be some defines affecting the > line numbers. I tried to read up on how ifdefs affects line numbers in > gdb backtraces to see if the error could be somewhere else than the > "default" line 1437, but I unfortunately couldn't find anything. > Probably my google-foo is a bit rusty. > > When looking over my notes, I see that I have made the following > observations related to this bug: > > * Crash happens quite rarely. > * I have only seen the bug right after boot. > * When the bug strikes, dnsmasq will enter a crash loop and never > recover. I.e., I can restart dnsmasq as many times as I like, crash > always happens. > * If I start dnsmasq manually and run it in the foreground after a > crash, I also see the error. > > So there seems to be something in the system causing this error, but I > can't figure out what. > >> I think the best solution is to wrap all of >> >> *crecp = *source; >> crecp->flags &= ~(F_IPV4 | F_IPV6 | F_CNAME | F_DNSKEY | F_DS | >> F_REVERSE); >> crecp->flags |= F_NAMEP; >> crecp->name.namep = name; >> >> cache_hash(crecp); >> >> with >> >> if (crecp) >> { >> } > > Thanks, this is basically the same as my current fix, so I can already > report that it is good :) > > BR, > Kristian > And I backported the fix into openwrt master this morning. Cheers, Kevin D-B 012C ACB2 28C6 C53E 9775 9123 B3A2 389B 9DE2 334A signature.asc Description: Message signed with OpenPGP ___ Dnsmasq-discuss mailing list Dnsmasq-discuss@lists.thekelleys.org.uk http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
Re: [Dnsmasq-discuss] Seg. fault in cache.c after commt b6f926fb
Hi Simon, Thanks for a quick reply. On Wed, Sep 19, 2018 at 12:23 AM Simon Kelley wrote: > Thanks for the report. The obvious explanation is that whine_malloc() is > returning NULL, and the code should handle that. whine_malloc only > returns NULL if the system cannot allocate any more memory, which is > possible, but unlikely. Is your router very short on memory? No, the router has plenty of memory (2GB) and I don't see the "failed to allocate"-message, so I guess whine_malloc() can't be the culprit. Since I am using OpenWRT, there could be some defines affecting the line numbers. I tried to read up on how ifdefs affects line numbers in gdb backtraces to see if the error could be somewhere else than the "default" line 1437, but I unfortunately couldn't find anything. Probably my google-foo is a bit rusty. When looking over my notes, I see that I have made the following observations related to this bug: * Crash happens quite rarely. * I have only seen the bug right after boot. * When the bug strikes, dnsmasq will enter a crash loop and never recover. I.e., I can restart dnsmasq as many times as I like, crash always happens. * If I start dnsmasq manually and run it in the foreground after a crash, I also see the error. So there seems to be something in the system causing this error, but I can't figure out what. > I think the best solution is to wrap all of > > *crecp = *source; > crecp->flags &= ~(F_IPV4 | F_IPV6 | F_CNAME | F_DNSKEY | F_DS | > F_REVERSE); > crecp->flags |= F_NAMEP; > crecp->name.namep = name; > > cache_hash(crecp); > > with > > if (crecp) > { > } Thanks, this is basically the same as my current fix, so I can already report that it is good :) BR, Kristian ___ Dnsmasq-discuss mailing list Dnsmasq-discuss@lists.thekelleys.org.uk http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
Re: [Dnsmasq-discuss] Seg. fault in cache.c after commt b6f926fb
On 18/09/18 11:28, Kristian Evensen wrote: > Hello, > > I recently updated one of my x86-based OpenWRT-routers to the latest > nightly, which bumped dnsmasq to 2.80test6. After the update, I see > that dnsmasq sometimes enters a crash loop. The crash occurs right > startup (SIGSEV), and the backtrace, looks as follows: > > 0x00406a78 in make_non_terminals > (source=source@entry=0x77ffefa0) at cache.c:1437 > 1437 cache.c: No such file or directory. > (gdb) bt > #0 0x00406a78 in make_non_terminals > (source=source@entry=0x77ffefa0) at cache.c:1437 > #1 0x00407192 in add_hosts_entry (cache=0x77ffefa0, > addr=0x7fffea28, addrlen=4, index=2, rhash=, > hashsz=) at cache.c:911 > #2 0x004074e1 in read_hostsfile > (filename=filename@entry=0x425a8a "/etc/hosts", index=index@entry=2, > cache_size=cache_size@entry=150, rhash=0x635020, > hashsz=hashsz@entry=641) at cache.c:1040 > #3 0x00407810 in cache_reload () at cache.c:1185 > #4 0x00418037 in clear_cache_and_reload > (now=now@entry=1537265437) at dnsmasq.c:1547 > #5 0x00405ec3 in async_event (now=1537265437, pipe=15) at > dnsmasq.c:1310 > #6 main (argc=, argv=) at dnsmasq.c:1077 > > My current work-around is to check if crecp is NULL and then return > from make_non_terminals(). However, I don't know the code well enough > to know if this change will break anything else (though everything > seems to work fine). > > There is nothing special with my hosts file, it looks as follows: > > 127.0.0.1 localhost > > ::1 localhost ip6-localhost ip6-loopback > ff02::1 ip6-allnodes > ff02::2 ip6-allrouters > 192.168.5.1 myrouter.lan > > If checking for NULL is acceptable, I will be more than happy to send > my patch. If not, I wonder if anyone has any hints on how to proceed > to solve this issue? > Thanks for the report. The obvious explanation is that whine_malloc() is returning NULL, and the code should handle that. whine_malloc only returns NULL if the system cannot allocate any more memory, which is possible, but unlikely. Is your router very short on memory? whine_malloc() is just a wrapper around malloc that logs an error if the malloc fails (hence the name) Are you seeing messages like failed to allocate %d bytes just before the crashes? That would make me more comfortable that we understand the problem. I think the best solution is to wrap all of *crecp = *source; crecp->flags &= ~(F_IPV4 | F_IPV6 | F_CNAME | F_DNSKEY | F_DS | F_REVERSE); crecp->flags |= F_NAMEP; crecp->name.namep = name; cache_hash(crecp); with if (crecp) { } and I'll commit that to the git repo now. Cheers, Simon ___ Dnsmasq-discuss mailing list Dnsmasq-discuss@lists.thekelleys.org.uk http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
[Dnsmasq-discuss] Seg. fault in cache.c after commt b6f926fb
Hello, I recently updated one of my x86-based OpenWRT-routers to the latest nightly, which bumped dnsmasq to 2.80test6. After the update, I see that dnsmasq sometimes enters a crash loop. The crash occurs right startup (SIGSEV), and the backtrace, looks as follows: 0x00406a78 in make_non_terminals (source=source@entry=0x77ffefa0) at cache.c:1437 1437 cache.c: No such file or directory. (gdb) bt #0 0x00406a78 in make_non_terminals (source=source@entry=0x77ffefa0) at cache.c:1437 #1 0x00407192 in add_hosts_entry (cache=0x77ffefa0, addr=0x7fffea28, addrlen=4, index=2, rhash=, hashsz=) at cache.c:911 #2 0x004074e1 in read_hostsfile (filename=filename@entry=0x425a8a "/etc/hosts", index=index@entry=2, cache_size=cache_size@entry=150, rhash=0x635020, hashsz=hashsz@entry=641) at cache.c:1040 #3 0x00407810 in cache_reload () at cache.c:1185 #4 0x00418037 in clear_cache_and_reload (now=now@entry=1537265437) at dnsmasq.c:1547 #5 0x00405ec3 in async_event (now=1537265437, pipe=15) at dnsmasq.c:1310 #6 main (argc=, argv=) at dnsmasq.c:1077 My current work-around is to check if crecp is NULL and then return from make_non_terminals(). However, I don't know the code well enough to know if this change will break anything else (though everything seems to work fine). There is nothing special with my hosts file, it looks as follows: 127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback ff02::1 ip6-allnodes ff02::2 ip6-allrouters 192.168.5.1 myrouter.lan If checking for NULL is acceptable, I will be more than happy to send my patch. If not, I wonder if anyone has any hints on how to proceed to solve this issue? BR, Kristian ___ Dnsmasq-discuss mailing list Dnsmasq-discuss@lists.thekelleys.org.uk http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss