On Wed, Jun 17, 2020 at 04:57:54PM +0200, Joerg Jung wrote:
>
> > On 17. Feb 2020, at 15:16, Martin Pieuchot <[email protected]> wrote:
> > On 17/02/20(Mon) 14:55, Joerg Jung wrote:
> >>
> >>> On 26. Sep 2019, at 15:02, Stuart Henderson <[email protected]> wrote:
> >>> On 2019/09/26 13:45, Stuart Henderson wrote:
> >>>> On 2019/09/26 11:16, Joerg Jung wrote:
> >>>>>
> >>>>>
> >>>>> I run a few busy (~800 req/s) NSD servers which I upgraded
> >>>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
> >>>>> sysctl values and nsd.conf matches the default as well, just
> >>>>> added a few hundred zones.
> >>>>>
> >>>>> Now, when I increase servers from default 1 to 2 in nsd.conf:
> >>>>> server-count: 2
> >>>>> it starts spamming my log with:
> >>>>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily
> >>>>> unavailable
> >>>>>
> >>>>> checking the source, server.c seems not to handle EAGAIN
> >>>>> after sendto() and does not recover or retry, it just increases
> >>>>> txerr statistic count - so answer seems really lost :(
> >>>>>
> >>>>> I tried higher debug level, as well as increasing socket buffers to:
> >>>>> net.inet.udp.recvspace= 65536
> >>>>> net.inet.udp.sendspace=65636
> >>>>> but both didn’t help and netstat -s -p udp does show
> >>>>> 0 dropped due to full socket buffers
> >>>>> anyways. So, I don’t believe this is a socket buffer issue.
> >>>>>
> >>>>> The same server-count: 2 setting worked fine with 6.3.
> >>>>>
> >>>>> Any hints, insights, or pointers?
> >>>>> Does anyone else experience the same?
> >>>>>
> >>>>
> >>>> Maybe it's worth trying to track down further whether this is due to an
> >>>> NSD change or something else in the OS - cvs up -r OPENBSD_6_3 .. (be
> >>>> sure
> >>>> to use "make -f Makefile.bsd-wrapper [..]" when building).
> >>>>
> >>>
> >>> Or, following a comment from claudio@, try a kernel built with this:
> >>
> >> FYI, I tried that diff and a few other things but neither did help.
>
> FYI, after upgrade to 6.7 the issue is still the same,
> just with different format for the syslog message not
> explicitly mentioning sendto() anymore, the new error
> log repeating gazillion of times looks like this:
> nsd[55919]: sendmmsg [0]=1.2.3.4 count=1 failed: Resource temporarily
> unavailable
>
> > Did you ktrace(1) the problem?
>
> Yes, please see below.
>
> > How is sendto(2) called, in particular
> > is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?
>
> From the ktrace dump it seems not sendto() being the culprit, but recvfrom().
> Neither, sendto() nor recvfrom() seem to have any flags set.
I think this is an error in your analysis. It's perfectly normal for
recvform to return EAGAIN.
On the sending side it indicates an issue.
-Otto
>
> > Does
> > that mean the kernel returns EWOULDBLOCK even if the userland said it is
> > fine to block?
>
> I don’t think so.
>
>
> From my limited understanding it seems the two processes
> seem to concurrently try to recvfrom() after BOTH received kevent()
> notification with only one being the “winner” and the other one
> logging the failing?
>
> Shouldn’t only one of the two try to call recvfrom()?
>
> Not sure if this is a bug in OpenBSD or NSD, but
> it worked fine in earlier releases.
>
> However, this one seems related:
> https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385
> <https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=385>
>
>
>
> 52438 nsd STRU struct kevent { ident=5, filter=EVFILT_READ,
> flags=0x1<EV_ADD>, fflags=0<>, data=62
> , udata=0x30562fe4198 }
> 55919 nsd STRU struct kevent { ident=5, filter=EVFILT_READ,
> flags=0x1<EV_ADD>, fflags=0<>, data=62
> , udata=0x30562fe4198 }
> 52438 nsd RET kevent 1
> 55919 nsd RET kevent 1
> 52438 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> 55919 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> 52438 nsd STRU struct timespec { 1224.307942259 }
> 55919 nsd STRU struct timespec { 1224.307946954 }
> 52438 nsd RET clock_gettime 0
> 55919 nsd RET clock_gettime 0
> 52438 nsd CALL
> recvfrom(5,0x30467ecd000,0x20109,0,0x304e553dc18,0x302669c2aa8)
> 55919 nsd CALL
> recvfrom(5,0x30497a9d000,0x20109,0,0x304e553dc18,0x302669c2aa8)
> 52438 nsd GIO fd 5 read 46 bytes
>
> "\aY\0\^P\0\^A\0\0\0\0\0\^A\^Dwpad\^Efoo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
> 55919 nsd RET recvfrom -1 errno 35 Resource temporarily unavailable
> 52438 nsd STRU struct sockaddr { AF_INET, 1.2.3.4:40276 }
> 55919 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> 52438 nsd RET recvfrom 46/0x2e
> 55919 nsd STRU struct timespec { 1224.307967366 }
> 55919 nsd RET clock_gettime 0
> 52438 nsd CALL gettimeofday(0x7f7ffffbeed0,0)
> 55919 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> 52438 nsd STRU struct timeval { 1592402542<"Jun 17 10:02:22
> 2020">.287233 }
> 55919 nsd STRU struct timespec { 1224.307977172 }
> 52438 nsd RET gettimeofday 0
> 55919 nsd RET clock_gettime 0
> 52438 nsd CALL sendto(5,0x30467ecd000,0x2e,0,0x304e553dc18,0x10)
> 55919 nsd CALL kevent(11,0,0,0x30549983000,64,0x7f7ffffbf0f8)
> 52438 nsd STRU struct sockaddr { AF_INET, 1.2.3.4:40276 }
> 55919 nsd STRU struct timespec { 109.487449000 }
> 52438 nsd GIO fd 5 wrote 46 bytes
>
> "\aY\M^@\^E\0\^A\0\0\0\0\0\^A\^Dwpad\^foo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
> 52438 nsd RET sendto 46/0x2e
>
>
>
>
> >>> Index: syscalls.master
> >>> ===================================================================
> >>> RCS file: /cvs/src/sys/kern/syscalls.master,v
> >>> retrieving revision 1.189
> >>> diff -u -p -r1.189 syscalls.master
> >>> --- syscalls.master 11 Jan 2019 18:46:30 -0000 1.189
> >>> +++ syscalls.master 26 Sep 2019 13:01:46 -0000
> >>> @@ -261,7 +261,7 @@
> >>> 130 OBSOL oftruncate
> >>> 131 STD { int sys_flock(int fd, int how); }
> >>> 132 STD { int sys_mkfifo(const char *path, mode_t
> >>> mode); }
> >>> -133 STD NOLOCK { ssize_t sys_sendto(int s, const void *buf, \
> >>> +133 STD { ssize_t sys_sendto(int s, const void *buf, \
> >>> size_t len, int flags, const struct sockaddr *to, \
> >>> socklen_t tolen); }
> >>> 134 STD { int sys_shutdown(int s, int how); }
> >>>
> >>>
> >>> Run "make syscalls" in sys/kern before building.
>