> On 19. Jun 2020, at 15:03, Wouter Wijngaards <[email protected]> wrote:
> On 19/06/2020 13:50, Joerg Jung wrote:
>> 
>>> On 18. Jun 2020, at 11:53, Wouter Wijngaards <[email protected] 
>>> <mailto:[email protected]>
>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>> On 18/06/2020 11:49, Otto Moerbeek wrote:
>>>> On Thu, Jun 18, 2020 at 11:45:01AM +0200, Joerg Jung wrote:
>>>>>> On 18. Jun 2020, at 09:47, Otto Moerbeek <[email protected] 
>>>>>> <mailto:[email protected]>
>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>> On Wed, Jun 17, 2020 at 04:57:54PM +0200, Joerg Jung wrote:
>>>>>>>> On 17. Feb 2020, at 15:16, Martin Pieuchot <[email protected] 
>>>>>>>> <mailto:[email protected]>
>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>> On 17/02/20(Mon) 14:55, Joerg Jung wrote:
>>>>>>>>> 
>>>>>>>>>> On 26. Sep 2019, at 15:02, Stuart Henderson
>>>>>>>>>> <[email protected] <mailto:[email protected]> 
>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>> On 2019/09/26 13:45, Stuart Henderson wrote:
>>>>>>>>>>> On 2019/09/26 11:16, Joerg Jung wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I run a few busy (~800 req/s) NSD servers which I upgraded 
>>>>>>>>>>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any 
>>>>>>>>>>>> sysctl values and nsd.conf matches the default as well, just 
>>>>>>>>>>>> added a few hundred zones.
>>>>>>>>>>>> 
>>>>>>>>>>>> Now, when I increase servers from default 1 to 2 in nsd.conf: 
>>>>>>>>>>>> server-count: 2
>>>>>>>>>>>> it starts spamming my log with:
>>>>>>>>>>>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily
>>>>>>>>>>>> unavailable
>>>>>>>>>>>> 
>>>>>>> 
>>>>>>> FYI, after upgrade to 6.7 the issue is still the same, 
>>>>>>> just with different format for the syslog message not 
>>>>>>> explicitly mentioning sendto() anymore, the new error
>>>>>>> log repeating gazillion of times looks like this:
>>>>>>>   nsd[55919]: sendmmsg [0]=1.2.3.4 count=1 failed: Resource
>>>>>>> temporarily unavailable
>>>>>>> 
>>>>>>>> Did you ktrace(1) the problem?
>>>>>>> 
>>>>>>> Yes, please see below.
>>>>>>> 
>>>>>>>> How is sendto(2) called, in particular
>>>>>>>> is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?  
>>>>>>> 
>>>>>>> From the ktrace dump it seems not sendto() being the culprit, but
>>>>>>> recvfrom().
>>>>>>> Neither, sendto() nor recvfrom() seem to have any flags set.
>>>>>> 
>>>>>> I think this is an error in your analysis. It's perfectly normal for
>>>>>> recvform to return EAGAIN.
>>>>> 
>>>>> Yes, the log message coming from sendto() / sendmmsg() was 
>>>>> misleading here.
>>>>> 
>>>>>> On the sending side it indicates an issue. 
>>>>> 
>>>>> Yes, as verified with ktrace the sendto() works fine.
>>>>> 
>>>>> However, it’s not normal to write 200 log messages per second for
>>>>> recvfrom() EAGAIN.
>>>>> 
>>>>> So, seems like a (logging) bug in NSD: It looks like the sendmmsg() 
>>>>> wrapper logs the earlier stored EAGAIN from resvmmsg() wrapper
>>>>> — if verbosity is 1, which is OpenBSD default.
>>>>> 
>>>>> Besides, I believe (reading man pages) these network errors 
>>>>> should be only logged at a higher verbosity 2 anyways.
>>> 
>>> Commit
>>> https://github.com/NLnetLabs/nsd/commit/374a713b0d175dfe0127f05d986f5f9e1ca8c506
>>> should remove the EAGAIN log lines so they don't bother you.
>> 
>> Thanks for looking into it.
>> 
>>> At first I thought you were talking about the return value of the
>>> syscall, but instead it is one of the elements of the send list.
>> 
>> There seems only 1 element in the send list (usually), 
>> which is sent fully and successfully with no error from sendto()
>> and no EAGAIN being generated. 
>> 
>> So, I wonder where the EAGAIN is actually coming from:
>> Looking at the ktrace output below and the code it seems most
>> likely to be propagated from recvmmsg() to sendmmsg(), no? 
>> 
>> Wild guess: Is there an off-by-one in the while loop, which results
>> in sendmmsg() called again, which restores the previous errno 
>> from recvmmsg() but does not hit sendto() again.
>> 
>> So, your recent commit may just “hide" the actual “issue".
> 
> You may be right, does this commit look better:
> https://github.com/NLnetLabs/nsd/commit/a74e23f4a4048b716b0a22b6d3a71dea7afcd104
>  
> <https://github.com/NLnetLabs/nsd/commit/a74e23f4a4048b716b0a22b6d3a71dea7afcd104>
> 
> It retries when UDP send buffers are full and sets the blocking flag so
> that the operation completes.

Hah. that’s a large hammer, which may work and seems the right thing 
to do IFF EAGAIN is triggered from the sendto()/sendmmsg() due to full
buffers. 

But from the ktrace the EAGAIN seems not come from sendto()/sendmmsg(), 
since sendto() just finishes fine (see last line in ktrace below) -- so no 
buffer 
issues here. Actually, full ktrace contains only EAGAIN from recvfrom() and 
none from sendto() (or any other syscall). As I wrote above, the EAGAIN 
seems just “propagated” from recvmmsg() to sendmmsg() due to the errno 
restore dance.

Why does the code postpone the error handling (and error logging) at all and
stores / restore errno for later iterations - why not instead handle the error 
(logging) 
immediately?


>>>>>>> 52438 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ,
>>>>>>> flags=0x1<EV_ADD>, fflags=0<>, data=62
>>>>>>> , udata=0x30562fe4198 }
>>>>>>> 55919 nsd      STRU  struct kevent { ident=5, filter=EVFILT_READ,
>>>>>>> flags=0x1<EV_ADD>, fflags=0<>, data=62
>>>>>>> , udata=0x30562fe4198 }
>>>>>>> 52438 nsd      RET   kevent 1
>>>>>>> 55919 nsd      RET   kevent 1
>>>>>>> 52438 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>>>> 52438 nsd      STRU  struct timespec { 1224.307942259 }
>>>>>>> 55919 nsd      STRU  struct timespec { 1224.307946954 }
>>>>>>> 52438 nsd      RET   clock_gettime 0
>>>>>>> 55919 nsd      RET   clock_gettime 0
>>>>>>> 52438 nsd      CALL
>>>>>>> recvfrom(5,0x30467ecd000,0x20109,0,0x304e553dc18,0x302669c2aa8)
>>>>>>> 55919 nsd      CALL
>>>>>>> recvfrom(5,0x30497a9d000,0x20109,0,0x304e553dc18,0x302669c2aa8)
>>>>>>> 52438 nsd      GIO   fd 5 read 46 bytes
>>>>>>>     
>>>>>>> "\aY\0\^P\0\^A\0\0\0\0\0\^A\^Dwpad\^Efoo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
>>>>>>> 55919 nsd      RET   recvfrom -1 errno 35 Resource temporarily
>>>>>>> unavailable
>>>>>>> 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
>>>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>>>> 52438 nsd      RET   recvfrom 46/0x2e
>>>>>>> 55919 nsd      STRU  struct timespec { 1224.307967366 }
>>>>>>> 55919 nsd      RET   clock_gettime 0
>>>>>>> 52438 nsd      CALL  gettimeofday(0x7f7ffffbeed0,0)
>>>>>>> 55919 nsd      CALL  clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>>>> 52438 nsd      STRU  struct timeval { 1592402542<"Jun 17 10:02:22
>>>>>>> 2020">.287233 }
>>>>>>> 55919 nsd      STRU  struct timespec { 1224.307977172 }
>>>>>>> 52438 nsd      RET   gettimeofday 0
>>>>>>> 55919 nsd      RET   clock_gettime 0
>>>>>>> 52438 nsd      CALL  sendto(5,0x30467ecd000,0x2e,0,0x304e553dc18,0x10)
>>>>>>> 55919 nsd      CALL  kevent(11,0,0,0x30549983000,64,0x7f7ffffbf0f8)
>>>>>>> 52438 nsd      STRU  struct sockaddr { AF_INET, 1.2.3.4:40276 }
>>>>>>> 55919 nsd      STRU  struct timespec { 109.487449000 }
>>>>>>> 52438 nsd      GIO   fd 5 wrote 46 bytes
>>>>>>>     
>>>>>>> "\aY\M^@\^E\0\^A\0\0\0\0\0\^A\^Dwpad\^foo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
>>>>>>> 52438 nsd      RET   sendto 46/0x2e

Reply via email to