> On 17. Feb 2020, at 15:16, Martin Pieuchot <[email protected]> wrote:
> On 17/02/20(Mon) 14:55, Joerg Jung wrote:
>> 
>>> On 26. Sep 2019, at 15:02, Stuart Henderson <[email protected]> wrote:
>>> On 2019/09/26 13:45, Stuart Henderson wrote:
>>>> On 2019/09/26 11:16, Joerg Jung wrote:
>>>>> 
>>>>> 
>>>>> I run a few busy (~800 req/s) NSD servers which I upgraded 
>>>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any 
>>>>> sysctl values and nsd.conf matches the default as well, just 
>>>>> added a few hundred zones.
>>>>> 
>>>>> Now, when I increase servers from default 1 to 2 in nsd.conf: 
>>>>>   server-count: 2
>>>>> it starts spamming my log with:
>>>>>   nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable
>>>>> 
>>>>> checking the source, server.c seems not to handle EAGAIN 
>>>>> after sendto() and does not recover or retry, it just increases
>>>>> txerr statistic count - so answer seems really lost :(
>>>>> 
>>>>> I tried higher debug level, as well as increasing socket buffers to: 
>>>>>   net.inet.udp.recvspace= 65536
>>>>>   net.inet.udp.sendspace=65636
>>>>> but both didn’t help and netstat -s -p udp does show 
>>>>>   0 dropped due to full socket buffers  
>>>>> anyways. So, I don’t believe this is a socket buffer issue.
>>>>> 
>>>>> The same server-count: 2 setting worked fine with 6.3.
>>>>> 
>>>>> Any hints, insights, or pointers?
>>>>> Does anyone else experience the same?
>>>>> 
>>>> 
>>>> Maybe it's worth trying to track down further whether this is due to an
>>>> NSD change or something else in the OS - cvs up -r OPENBSD_6_3 .. (be sure
>>>> to use "make -f Makefile.bsd-wrapper [..]" when building).
>>>> 
>>> 
>>> Or, following a comment from claudio@, try a kernel built with this:
>> 
>> FYI, I tried that diff and a few other things but neither did help. 

FYI, after upgrade to 6.7 the issue is still the same, 
just with different format for the syslog message not 
explicitly mentioning sendto() anymore, the new error
log repeating gazillion of times looks like this:
    nsd[55919]: sendmmsg [0]=1.2.3.4 count=1 failed: Resource temporarily 
unavailable

> Did you ktrace(1) the problem?

Yes, please see below.

> How is sendto(2) called, in particular
> is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?  

>From the ktrace dump it seems not sendto() being the culprit, but recvfrom().
Neither, sendto() nor recvfrom() seem to have any flags set.

> Does
> that mean the kernel returns EWOULDBLOCK even if the userland said it is
> fine to block?

I don’t think so. 


From my limited understanding it seems the two processes 
seem to concurrently try to recvfrom() after BOTH received kevent() 
notification with only one being the “winner” and the other one 
logging the failing?

Shouldn’t only one of the two try to call recvfrom()?

Not sure if this is a bug in OpenBSD or NSD, but
it worked fine in earlier releases.

However, this one seems related:
https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385 
<https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385>



 52438 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ, 
flags=0x1<EV_ADD>, fflags=0<>, data=62
, udata=0x30562fe4198 }
 55919 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ, 
flags=0x1<EV_ADD>, fflags=0<>, data=62
, udata=0x30562fe4198 }
 52438 nsd      RET   kevent 1
 55919 nsd      RET   kevent 1
 52438 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
 52438 nsd      STRU  struct timespec { 1224.307942259 }
 55919 nsd      STRU  struct timespec { 1224.307946954 }
 52438 nsd      RET   clock_gettime 0
 55919 nsd      RET   clock_gettime 0
 52438 nsd      CALL  
recvfrom(5,0x30467ecd000,0x20109,0,0x304e553dc18,0x302669c2aa8)
 55919 nsd      CALL  
recvfrom(5,0x30497a9d000,0x20109,0,0x304e553dc18,0x302669c2aa8)
 52438 nsd      GIO   fd 5 read 46 bytes
       
"\aY\0\^P\0\^A\0\0\0\0\0\^A\^Dwpad\^Efoo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
 55919 nsd      RET   recvfrom -1 errno 35 Resource temporarily unavailable
 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
 52438 nsd      RET   recvfrom 46/0x2e
 55919 nsd      STRU  struct timespec { 1224.307967366 }
 55919 nsd      RET   clock_gettime 0
 52438 nsd      CALL  gettimeofday(0x7f7ffffbeed0,0)
 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
 52438 nsd      STRU  struct timeval { 1592402542<"Jun 17 10:02:22 
2020">.287233 }
 55919 nsd      STRU  struct timespec { 1224.307977172 }
 52438 nsd      RET   gettimeofday 0
 55919 nsd      RET   clock_gettime 0
 52438 nsd      CALL  sendto(5,0x30467ecd000,0x2e,0,0x304e553dc18,0x10)
 55919 nsd      CALL  kevent(11,0,0,0x30549983000,64,0x7f7ffffbf0f8)
 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
 55919 nsd      STRU  struct timespec { 109.487449000 }
 52438 nsd      GIO   fd 5 wrote 46 bytes
       
"\aY\M^@\^E\0\^A\0\0\0\0\0\^A\^Dwpad\^foo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
 52438 nsd      RET   sendto 46/0x2e




>>> Index: syscalls.master
>>> ===================================================================
>>> RCS file: /cvs/src/sys/kern/syscalls.master,v
>>> retrieving revision 1.189
>>> diff -u -p -r1.189 syscalls.master
>>> --- syscalls.master 11 Jan 2019 18:46:30 -0000      1.189
>>> +++ syscalls.master 26 Sep 2019 13:01:46 -0000
>>> @@ -261,7 +261,7 @@
>>> 130 OBSOL           oftruncate
>>> 131 STD             { int sys_flock(int fd, int how); }
>>> 132 STD             { int sys_mkfifo(const char *path, mode_t mode); }
>>> -133        STD NOLOCK      { ssize_t sys_sendto(int s, const void *buf, \
>>> +133        STD             { ssize_t sys_sendto(int s, const void *buf, \
>>>                         size_t len, int flags, const struct sockaddr *to, \
>>>                         socklen_t tolen); }
>>> 134 STD             { int sys_shutdown(int s, int how); }
>>> 
>>> 
>>> Run "make syscalls" in sys/kern before building.

Reply via email to