On Sun, Jun 21, 2020 at 10:55:55PM +0200, Joerg Jung wrote:

> 
> 
> > On 19. Jun 2020, at 15:03, Wouter Wijngaards <[email protected]> wrote:
> > On 19/06/2020 13:50, Joerg Jung wrote:
> >> 
> >>> On 18. Jun 2020, at 11:53, Wouter Wijngaards <[email protected] 
> >>> <mailto:[email protected]>
> >>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>> On 18/06/2020 11:49, Otto Moerbeek wrote:
> >>>> On Thu, Jun 18, 2020 at 11:45:01AM +0200, Joerg Jung wrote:
> >>>>>> On 18. Jun 2020, at 09:47, Otto Moerbeek <[email protected] 
> >>>>>> <mailto:[email protected]>
> >>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>>>> On Wed, Jun 17, 2020 at 04:57:54PM +0200, Joerg Jung wrote:
> >>>>>>>> On 17. Feb 2020, at 15:16, Martin Pieuchot <[email protected] 
> >>>>>>>> <mailto:[email protected]>
> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>>>>>> On 17/02/20(Mon) 14:55, Joerg Jung wrote:
> >>>>>>>>> 
> >>>>>>>>>> On 26. Sep 2019, at 15:02, Stuart Henderson
> >>>>>>>>>> <[email protected] <mailto:[email protected]> 
> >>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>>>>>>>> On 2019/09/26 13:45, Stuart Henderson wrote:
> >>>>>>>>>>> On 2019/09/26 11:16, Joerg Jung wrote:
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> I run a few busy (~800 req/s) NSD servers which I upgraded 
> >>>>>>>>>>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any 
> >>>>>>>>>>>> sysctl values and nsd.conf matches the default as well, just 
> >>>>>>>>>>>> added a few hundred zones.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Now, when I increase servers from default 1 to 2 in nsd.conf: 
> >>>>>>>>>>>> server-count: 2
> >>>>>>>>>>>> it starts spamming my log with:
> >>>>>>>>>>>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily
> >>>>>>>>>>>> unavailable
> >>>>>>>>>>>> 
> >>>>>>> 
> >>>>>>> FYI, after upgrade to 6.7 the issue is still the same, 
> >>>>>>> just with different format for the syslog message not 
> >>>>>>> explicitly mentioning sendto() anymore, the new error
> >>>>>>> log repeating gazillion of times looks like this:
> >>>>>>>   nsd[55919]: sendmmsg [0]=1.2.3.4 count=1 failed: Resource
> >>>>>>> temporarily unavailable
> >>>>>>> 
> >>>>>>>> Did you ktrace(1) the problem?
> >>>>>>> 
> >>>>>>> Yes, please see below.
> >>>>>>> 
> >>>>>>>> How is sendto(2) called, in particular
> >>>>>>>> is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?  
> >>>>>>> 
> >>>>>>> From the ktrace dump it seems not sendto() being the culprit, but
> >>>>>>> recvfrom().
> >>>>>>> Neither, sendto() nor recvfrom() seem to have any flags set.
> >>>>>> 
> >>>>>> I think this is an error in your analysis. It's perfectly normal for
> >>>>>> recvform to return EAGAIN.
> >>>>> 
> >>>>> Yes, the log message coming from sendto() / sendmmsg() was 
> >>>>> misleading here.
> >>>>> 
> >>>>>> On the sending side it indicates an issue. 
> >>>>> 
> >>>>> Yes, as verified with ktrace the sendto() works fine.
> >>>>> 
> >>>>> However, it’s not normal to write 200 log messages per second for
> >>>>> recvfrom() EAGAIN.
> >>>>> 
> >>>>> So, seems like a (logging) bug in NSD: It looks like the sendmmsg() 
> >>>>> wrapper logs the earlier stored EAGAIN from resvmmsg() wrapper
> >>>>> — if verbosity is 1, which is OpenBSD default.
> >>>>> 
> >>>>> Besides, I believe (reading man pages) these network errors 
> >>>>> should be only logged at a higher verbosity 2 anyways.
> >>> 
> >>> Commit
> >>> https://github.com/NLnetLabs/nsd/commit/374a713b0d175dfe0127f05d986f5f9e1ca8c506
> >>> should remove the EAGAIN log lines so they don't bother you.
> >> 
> >> Thanks for looking into it.
> >> 
> >>> At first I thought you were talking about the return value of the
> >>> syscall, but instead it is one of the elements of the send list.
> >> 
> >> There seems only 1 element in the send list (usually), 
> >> which is sent fully and successfully with no error from sendto()
> >> and no EAGAIN being generated. 
> >> 
> >> So, I wonder where the EAGAIN is actually coming from:
> >> Looking at the ktrace output below and the code it seems most
> >> likely to be propagated from recvmmsg() to sendmmsg(), no? 
> >> 
> >> Wild guess: Is there an off-by-one in the while loop, which results
> >> in sendmmsg() called again, which restores the previous errno 
> >> from recvmmsg() but does not hit sendto() again.
> >> 
> >> So, your recent commit may just “hide" the actual “issue".
> > 
> > You may be right, does this commit look better:
> > https://github.com/NLnetLabs/nsd/commit/a74e23f4a4048b716b0a22b6d3a71dea7afcd104
> >  
> > <https://github.com/NLnetLabs/nsd/commit/a74e23f4a4048b716b0a22b6d3a71dea7afcd104>
> > 
> > It retries when UDP send buffers are full and sets the blocking flag so
> > that the operation completes.
> 
> Hah. that’s a large hammer, which may work and seems the right thing 
> to do IFF EAGAIN is triggered from the sendto()/sendmmsg() due to full
> buffers. 
> 
> But from the ktrace the EAGAIN seems not come from sendto()/sendmmsg(), 
> since sendto() just finishes fine (see last line in ktrace below) -- so no 
> buffer 
> issues here. Actually, full ktrace contains only EAGAIN from recvfrom() and 
> none from sendto() (or any other syscall). As I wrote above, the EAGAIN 
> seems just “propagated” from recvmmsg() to sendmmsg() due to the errno 
> restore dance.
> 
> Why does the code postpone the error handling (and error logging) at all and
> stores / restore errno for later iterations - why not instead handle the 
> error (logging) 
> immediately?
> 
> 
> >>>>>>> 52438 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ,
> >>>>>>> flags=0x1<EV_ADD>, fflags=0<>, data=62
> >>>>>>> , udata=0x30562fe4198 }
> >>>>>>> 55919 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ,
> >>>>>>> flags=0x1<EV_ADD>, fflags=0<>, data=62
> >>>>>>> , udata=0x30562fe4198 }
> >>>>>>> 52438 nsd      RET   kevent 1
> >>>>>>> 55919 nsd      RET   kevent 1
> >>>>>>> 52438 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >>>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >>>>>>> 52438 nsd      STRU  struct timespec { 1224.307942259 }
> >>>>>>> 55919 nsd      STRU  struct timespec { 1224.307946954 }
> >>>>>>> 52438 nsd      RET   clock_gettime 0
> >>>>>>> 55919 nsd      RET   clock_gettime 0
> >>>>>>> 52438 nsd      CALL
> >>>>>>> recvfrom(5,0x30467ecd000,0x20109,0,0x304e553dc18,0x302669c2aa8)
> >>>>>>> 55919 nsd      CALL
> >>>>>>> recvfrom(5,0x30497a9d000,0x20109,0,0x304e553dc18,0x302669c2aa8)
> >>>>>>> 52438 nsd      GIO   fd 5 read 46 bytes
> >>>>>>>     
> >>>>>>> "\aY\0\^P\0\^A\0\0\0\0\0\^A\^Dwpad\^Efoo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
> >>>>>>> 55919 nsd      RET   recvfrom -1 errno 35 Resource temporarily
> >>>>>>> unavailable
> >>>>>>> 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
> >>>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >>>>>>> 52438 nsd      RET   recvfrom 46/0x2e
> >>>>>>> 55919 nsd      STRU  struct timespec { 1224.307967366 }
> >>>>>>> 55919 nsd      RET   clock_gettime 0
> >>>>>>> 52438 nsd      CALL  gettimeofday(0x7f7ffffbeed0,0)
> >>>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
> >>>>>>> 52438 nsd      STRU  struct timeval { 1592402542<"Jun 17 10:02:22
> >>>>>>> 2020">.287233 }
> >>>>>>> 55919 nsd      STRU  struct timespec { 1224.307977172 }
> >>>>>>> 52438 nsd      RET   gettimeofday 0
> >>>>>>> 55919 nsd      RET   clock_gettime 0
> >>>>>>> 52438 nsd      CALL  sendto(5,0x30467ecd000,0x2e,0,0x304e553dc18,0x10)
> >>>>>>> 55919 nsd      CALL  kevent(11,0,0,0x30549983000,64,0x7f7ffffbf0f8)
> >>>>>>> 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
> >>>>>>> 55919 nsd      STRU  struct timespec { 109.487449000 }
> >>>>>>> 52438 nsd      GIO   fd 5 wrote 46 bytes
> >>>>>>>     
> >>>>>>> "\aY\M^@\^E\0\^A\0\0\0\0\0\^A\^Dwpad\^foo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
> >>>>>>> 52438 nsd      RET   sendto 46/0x2e
> 

I think we have different issues. I originally was thinking this was
about the occasional sendo EAGAIN that some see while the send buffer
does not appear to be full. That is different thing than recfrom
failing.

        -Otto

Reply via email to