> On 17. Feb 2020, at 15:16, Martin Pieuchot <[email protected]> wrote:
> On 17/02/20(Mon) 14:55, Joerg Jung wrote:
>>
>>> On 26. Sep 2019, at 15:02, Stuart Henderson <[email protected]> wrote:
>>> On 2019/09/26 13:45, Stuart Henderson wrote:
>>>> On 2019/09/26 11:16, Joerg Jung wrote:
>>>>>
>>>>>
>>>>> I run a few busy (~800 req/s) NSD servers which I upgraded
>>>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
>>>>> sysctl values and nsd.conf matches the default as well, just
>>>>> added a few hundred zones.
>>>>>
>>>>> Now, when I increase servers from default 1 to 2 in nsd.conf:
>>>>> server-count: 2
>>>>> it starts spamming my log with:
>>>>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable
>>>>>
>>>>> checking the source, server.c seems not to handle EAGAIN
>>>>> after sendto() and does not recover or retry, it just increases
>>>>> txerr statistic count - so answer seems really lost :(
>>>>>
>>>>> I tried higher debug level, as well as increasing socket buffers to:
>>>>> net.inet.udp.recvspace= 65536
>>>>> net.inet.udp.sendspace=65636
>>>>> but both didn’t help and netstat -s -p udp does show
>>>>> 0 dropped due to full socket buffers
>>>>> anyways. So, I don’t believe this is a socket buffer issue.
>>>>>
>>>>> The same server-count: 2 setting worked fine with 6.3.
>>>>>
>>>>> Any hints, insights, or pointers?
>>>>> Does anyone else experience the same?
>>>>>
>>>>
>>>> Maybe it's worth trying to track down further whether this is due to an
>>>> NSD change or something else in the OS - cvs up -r OPENBSD_6_3 .. (be sure
>>>> to use "make -f Makefile.bsd-wrapper [..]" when building).
>>>>
>>>
>>> Or, following a comment from claudio@, try a kernel built with this:
>>
>> FYI, I tried that diff and a few other things but neither did help.
FYI, after upgrade to 6.7 the issue is still the same,
just with different format for the syslog message not
explicitly mentioning sendto() anymore, the new error
log repeating gazillion of times looks like this:
nsd[55919]: sendmmsg [0]=1.2.3.4 count=1 failed: Resource temporarily
unavailable
> Did you ktrace(1) the problem?
Yes, please see below.
> How is sendto(2) called, in particular
> is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?
>From the ktrace dump it seems not sendto() being the culprit, but recvfrom().
Neither, sendto() nor recvfrom() seem to have any flags set.
> Does
> that mean the kernel returns EWOULDBLOCK even if the userland said it is
> fine to block?
I don’t think so.
From my limited understanding it seems the two processes
seem to concurrently try to recvfrom() after BOTH received kevent()
notification with only one being the “winner” and the other one
logging the failing?
Shouldn’t only one of the two try to call recvfrom()?
Not sure if this is a bug in OpenBSD or NSD, but
it worked fine in earlier releases.
However, this one seems related:
https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385
<https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385>
52438 nsd STRU struct kevent { ident=5, filter=EVFILT_READ,
flags=0x1<EV_ADD>, fflags=0<>, data=62
, udata=0x30562fe4198 }
55919 nsd STRU struct kevent { ident=5, filter=EVFILT_READ,
flags=0x1<EV_ADD>, fflags=0<>, data=62
, udata=0x30562fe4198 }
52438 nsd RET kevent 1
55919 nsd RET kevent 1
52438 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
55919 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
52438 nsd STRU struct timespec { 1224.307942259 }
55919 nsd STRU struct timespec { 1224.307946954 }
52438 nsd RET clock_gettime 0
55919 nsd RET clock_gettime 0
52438 nsd CALL
recvfrom(5,0x30467ecd000,0x20109,0,0x304e553dc18,0x302669c2aa8)
55919 nsd CALL
recvfrom(5,0x30497a9d000,0x20109,0,0x304e553dc18,0x302669c2aa8)
52438 nsd GIO fd 5 read 46 bytes
"\aY\0\^P\0\^A\0\0\0\0\0\^A\^Dwpad\^Efoo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
55919 nsd RET recvfrom -1 errno 35 Resource temporarily unavailable
52438 nsd STRU struct sockaddr { AF_INET, 1.2.3.4:40276 }
55919 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
52438 nsd RET recvfrom 46/0x2e
55919 nsd STRU struct timespec { 1224.307967366 }
55919 nsd RET clock_gettime 0
52438 nsd CALL gettimeofday(0x7f7ffffbeed0,0)
55919 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
52438 nsd STRU struct timeval { 1592402542<"Jun 17 10:02:22
2020">.287233 }
55919 nsd STRU struct timespec { 1224.307977172 }
52438 nsd RET gettimeofday 0
55919 nsd RET clock_gettime 0
52438 nsd CALL sendto(5,0x30467ecd000,0x2e,0,0x304e553dc18,0x10)
55919 nsd CALL kevent(11,0,0,0x30549983000,64,0x7f7ffffbf0f8)
52438 nsd STRU struct sockaddr { AF_INET, 1.2.3.4:40276 }
55919 nsd STRU struct timespec { 109.487449000 }
52438 nsd GIO fd 5 wrote 46 bytes
"\aY\M^@\^E\0\^A\0\0\0\0\0\^A\^Dwpad\^foo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
52438 nsd RET sendto 46/0x2e
>>> Index: syscalls.master
>>> ===================================================================
>>> RCS file: /cvs/src/sys/kern/syscalls.master,v
>>> retrieving revision 1.189
>>> diff -u -p -r1.189 syscalls.master
>>> --- syscalls.master 11 Jan 2019 18:46:30 -0000 1.189
>>> +++ syscalls.master 26 Sep 2019 13:01:46 -0000
>>> @@ -261,7 +261,7 @@
>>> 130 OBSOL oftruncate
>>> 131 STD { int sys_flock(int fd, int how); }
>>> 132 STD { int sys_mkfifo(const char *path, mode_t mode); }
>>> -133 STD NOLOCK { ssize_t sys_sendto(int s, const void *buf, \
>>> +133 STD { ssize_t sys_sendto(int s, const void *buf, \
>>> size_t len, int flags, const struct sockaddr *to, \
>>> socklen_t tolen); }
>>> 134 STD { int sys_shutdown(int s, int how); }
>>>
>>>
>>> Run "make syscalls" in sys/kern before building.