On Sun, Jun 21, 2020 at 10:55:55PM +0200, Joerg Jung wrote:
>
>
> > On 19. Jun 2020, at 15:03, Wouter Wijngaards <[email protected]> wrote:
> > On 19/06/2020 13:50, Joerg Jung wrote:
> >>
> >>> On 18. Jun 2020, at 11:53, Wouter Wijngaards <[email protected]
> >>> <mailto:[email protected]>
> >>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>> On 18/06/2020 11:49, Otto Moerbeek wrote:
> >>>> On Thu, Jun 18, 2020 at 11:45:01AM +0200, Joerg Jung wrote:
> >>>>>> On 18. Jun 2020, at 09:47, Otto Moerbeek <[email protected]
> >>>>>> <mailto:[email protected]>
> >>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>>>> On Wed, Jun 17, 2020 at 04:57:54PM +0200, Joerg Jung wrote:
> >>>>>>>> On 17. Feb 2020, at 15:16, Martin Pieuchot <[email protected]
> >>>>>>>> <mailto:[email protected]>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>>>>>> On 17/02/20(Mon) 14:55, Joerg Jung wrote:
> >>>>>>>>>
> >>>>>>>>>> On 26. Sep 2019, at 15:02, Stuart Henderson
> >>>>>>>>>> <[email protected] <mailto:[email protected]>
> >>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>>>>>>>> On 2019/09/26 13:45, Stuart Henderson wrote:
> >>>>>>>>>>> On 2019/09/26 11:16, Joerg Jung wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I run a few busy (~800 req/s) NSD servers which I upgraded
> >>>>>>>>>>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
> >>>>>>>>>>>> sysctl values and nsd.conf matches the default as well, just
> >>>>>>>>>>>> added a few hundred zones.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Now, when I increase servers from default 1 to 2 in nsd.conf:
> >>>>>>>>>>>> server-count: 2
> >>>>>>>>>>>> it starts spamming my log with:
> >>>>>>>>>>>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily
> >>>>>>>>>>>> unavailable
> >>>>>>>>>>>>
> >>>>>>>
> >>>>>>> FYI, after upgrade to 6.7 the issue is still the same,
> >>>>>>> just with different format for the syslog message not
> >>>>>>> explicitly mentioning sendto() anymore, the new error
> >>>>>>> log repeating gazillion of times looks like this:
> >>>>>>> nsd[55919]: sendmmsg [0]=1.2.3.4 count=1 failed: Resource
> >>>>>>> temporarily unavailable
> >>>>>>>
> >>>>>>>> Did you ktrace(1) the problem?
> >>>>>>>
> >>>>>>> Yes, please see below.
> >>>>>>>
> >>>>>>>> How is sendto(2) called, in particular
> >>>>>>>> is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?
> >>>>>>>
> >>>>>>> From the ktrace dump it seems not sendto() being the culprit, but
> >>>>>>> recvfrom().
> >>>>>>> Neither, sendto() nor recvfrom() seem to have any flags set.
> >>>>>>
> >>>>>> I think this is an error in your analysis. It's perfectly normal for
> >>>>>> recvform to return EAGAIN.
> >>>>>
> >>>>> Yes, the log message coming from sendto() / sendmmsg() was
> >>>>> misleading here.
> >>>>>
> >>>>>> On the sending side it indicates an issue.
> >>>>>
> >>>>> Yes, as verified with ktrace the sendto() works fine.
> >>>>>
> >>>>> However, it’s not normal to write 200 log messages per second for
> >>>>> recvfrom() EAGAIN.
> >>>>>
> >>>>> So, seems like a (logging) bug in NSD: It looks like the sendmmsg()
> >>>>> wrapper logs the earlier stored EAGAIN from resvmmsg() wrapper
> >>>>> — if verbosity is 1, which is OpenBSD default.
> >>>>>
> >>>>> Besides, I believe (reading man pages) these network errors
> >>>>> should be only logged at a higher verbosity 2 anyways.
> >>>
> >>> Commit
> >>> https://github.com/NLnetLabs/nsd/commit/374a713b0d175dfe0127f05d986f5f9e1ca8c506
> >>> should remove the EAGAIN log lines so they don't bother you.
> >>
> >> Thanks for looking into it.
> >>
> >>> At first I thought you were talking about the return value of the
> >>> syscall, but instead it is one of the elements of the send list.
> >>
> >> There seems only 1 element in the send list (usually),
> >> which is sent fully and successfully with no error from sendto()
> >> and no EAGAIN being generated.
> >>
> >> So, I wonder where the EAGAIN is actually coming from:
> >> Looking at the ktrace output below and the code it seems most
> >> likely to be propagated from recvmmsg() to sendmmsg(), no?
> >>
> >> Wild guess: Is there an off-by-one in the while loop, which results
> >> in sendmmsg() called again, which restores the previous errno
> >> from recvmmsg() but does not hit sendto() again.
> >>
> >> So, your recent commit may just “hide" the actual “issue".
> >
> > You may be right, does this commit look better:
> > https://github.com/NLnetLabs/nsd/commit/a74e23f4a4048b716b0a22b6d3a71dea7afcd104
> >
> > <https://github.com/NLnetLabs/nsd/commit/a74e23f4a4048b716b0a22b6d3a71dea7afcd104>
> >
> > It retries when UDP send buffers are full and sets the blocking flag so
> > that the operation completes.
>
> Hah. that’s a large hammer, which may work and seems the right thing
> to do IFF EAGAIN is triggered from the sendto()/sendmmsg() due to full
> buffers.
>
> But from the ktrace the EAGAIN seems not come from sendto()/sendmmsg(),
> since sendto() just finishes fine (see last line in ktrace below) -- so no
> buffer
> issues here. Actually, full ktrace contains only EAGAIN from recvfrom() and
> none from sendto() (or any other syscall). As I wrote above, the EAGAIN
> seems just “propagated” from recvmmsg() to sendmmsg() due to the errno
> restore dance.
>
> Why does the code postpone the error handling (and error logging) at all and
> stores / restore errno for later iterations - why not instead handle the
> error (logging)
> immediately?
>
>
> >>>>>>> 52438 nsd STRU struct kevent { ident=5, filter=EVFILT_READ,
> >>>>>>> flags=0x1<EV_ADD>, fflags=0<>, data=62
> >>>>>>> , udata=0x30562fe4198 }
> >>>>>>> 55919 nsd STRU struct kevent { ident=5, filter=EVFILT_READ,
> >>>>>>> flags=0x1<EV_ADD>, fflags=0<>, data=62
> >>>>>>> , udata=0x30562fe4198 }
> >>>>>>> 52438 nsd RET kevent 1
> >>>>>>> 55919 nsd RET kevent 1
> >>>>>>> 52438 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >>>>>>> 55919 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >>>>>>> 52438 nsd STRU struct timespec { 1224.307942259 }
> >>>>>>> 55919 nsd STRU struct timespec { 1224.307946954 }
> >>>>>>> 52438 nsd RET clock_gettime 0
> >>>>>>> 55919 nsd RET clock_gettime 0
> >>>>>>> 52438 nsd CALL
> >>>>>>> recvfrom(5,0x30467ecd000,0x20109,0,0x304e553dc18,0x302669c2aa8)
> >>>>>>> 55919 nsd CALL
> >>>>>>> recvfrom(5,0x30497a9d000,0x20109,0,0x304e553dc18,0x302669c2aa8)
> >>>>>>> 52438 nsd GIO fd 5 read 46 bytes
> >>>>>>>
> >>>>>>> "\aY\0\^P\0\^A\0\0\0\0\0\^A\^Dwpad\^Efoo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
> >>>>>>> 55919 nsd RET recvfrom -1 errno 35 Resource temporarily
> >>>>>>> unavailable
> >>>>>>> 52438 nsd STRU struct sockaddr { AF_INET, 1.2.3.4:40276 }
> >>>>>>> 55919 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >>>>>>> 52438 nsd RET recvfrom 46/0x2e
> >>>>>>> 55919 nsd STRU struct timespec { 1224.307967366 }
> >>>>>>> 55919 nsd RET clock_gettime 0
> >>>>>>> 52438 nsd CALL gettimeofday(0x7f7ffffbeed0,0)
> >>>>>>> 55919 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >>>>>>> 52438 nsd STRU struct timeval { 1592402542<"Jun 17 10:02:22
> >>>>>>> 2020">.287233 }
> >>>>>>> 55919 nsd STRU struct timespec { 1224.307977172 }
> >>>>>>> 52438 nsd RET gettimeofday 0
> >>>>>>> 55919 nsd RET clock_gettime 0
> >>>>>>> 52438 nsd CALL sendto(5,0x30467ecd000,0x2e,0,0x304e553dc18,0x10)
> >>>>>>> 55919 nsd CALL kevent(11,0,0,0x30549983000,64,0x7f7ffffbf0f8)
> >>>>>>> 52438 nsd STRU struct sockaddr { AF_INET, 1.2.3.4:40276 }
> >>>>>>> 55919 nsd STRU struct timespec { 109.487449000 }
> >>>>>>> 52438 nsd GIO fd 5 wrote 46 bytes
> >>>>>>>
> >>>>>>> "\aY\M^@\^E\0\^A\0\0\0\0\0\^A\^Dwpad\^foo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
> >>>>>>> 52438 nsd RET sendto 46/0x2e
>
I think we have different issues. I originally was thinking this was
about the occasional sendo EAGAIN that some see while the send buffer
does not appear to be full. That is different thing than recfrom
failing.
-Otto