> On 19. Jun 2020, at 15:03, Wouter Wijngaards <[email protected]> wrote:
> On 19/06/2020 13:50, Joerg Jung wrote:
>>
>>> On 18. Jun 2020, at 11:53, Wouter Wijngaards <[email protected]
>>> <mailto:[email protected]>
>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>> On 18/06/2020 11:49, Otto Moerbeek wrote:
>>>> On Thu, Jun 18, 2020 at 11:45:01AM +0200, Joerg Jung wrote:
>>>>>> On 18. Jun 2020, at 09:47, Otto Moerbeek <[email protected]
>>>>>> <mailto:[email protected]>
>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>> On Wed, Jun 17, 2020 at 04:57:54PM +0200, Joerg Jung wrote:
>>>>>>>> On 17. Feb 2020, at 15:16, Martin Pieuchot <[email protected]
>>>>>>>> <mailto:[email protected]>
>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>> On 17/02/20(Mon) 14:55, Joerg Jung wrote:
>>>>>>>>>
>>>>>>>>>> On 26. Sep 2019, at 15:02, Stuart Henderson
>>>>>>>>>> <[email protected] <mailto:[email protected]>
>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>> On 2019/09/26 13:45, Stuart Henderson wrote:
>>>>>>>>>>> On 2019/09/26 11:16, Joerg Jung wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I run a few busy (~800 req/s) NSD servers which I upgraded
>>>>>>>>>>>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any
>>>>>>>>>>>> sysctl values and nsd.conf matches the default as well, just
>>>>>>>>>>>> added a few hundred zones.
>>>>>>>>>>>>
>>>>>>>>>>>> Now, when I increase servers from default 1 to 2 in nsd.conf:
>>>>>>>>>>>> server-count: 2
>>>>>>>>>>>> it starts spamming my log with:
>>>>>>>>>>>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily
>>>>>>>>>>>> unavailable
>>>>>>>>>>>>
>>>>>>>
>>>>>>> FYI, after upgrade to 6.7 the issue is still the same,
>>>>>>> just with different format for the syslog message not
>>>>>>> explicitly mentioning sendto() anymore, the new error
>>>>>>> log repeating gazillion of times looks like this:
>>>>>>> nsd[55919]: sendmmsg [0]=1.2.3.4 count=1 failed: Resource
>>>>>>> temporarily unavailable
>>>>>>>
>>>>>>>> Did you ktrace(1) the problem?
>>>>>>>
>>>>>>> Yes, please see below.
>>>>>>>
>>>>>>>> How is sendto(2) called, in particular
>>>>>>>> is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?
>>>>>>>
>>>>>>> From the ktrace dump it seems not sendto() being the culprit, but
>>>>>>> recvfrom().
>>>>>>> Neither, sendto() nor recvfrom() seem to have any flags set.
>>>>>>
>>>>>> I think this is an error in your analysis. It's perfectly normal for
>>>>>> recvform to return EAGAIN.
>>>>>
>>>>> Yes, the log message coming from sendto() / sendmmsg() was
>>>>> misleading here.
>>>>>
>>>>>> On the sending side it indicates an issue.
>>>>>
>>>>> Yes, as verified with ktrace the sendto() works fine.
>>>>>
>>>>> However, it’s not normal to write 200 log messages per second for
>>>>> recvfrom() EAGAIN.
>>>>>
>>>>> So, seems like a (logging) bug in NSD: It looks like the sendmmsg()
>>>>> wrapper logs the earlier stored EAGAIN from resvmmsg() wrapper
>>>>> — if verbosity is 1, which is OpenBSD default.
>>>>>
>>>>> Besides, I believe (reading man pages) these network errors
>>>>> should be only logged at a higher verbosity 2 anyways.
>>>
>>> Commit
>>> https://github.com/NLnetLabs/nsd/commit/374a713b0d175dfe0127f05d986f5f9e1ca8c506
>>> should remove the EAGAIN log lines so they don't bother you.
>>
>> Thanks for looking into it.
>>
>>> At first I thought you were talking about the return value of the
>>> syscall, but instead it is one of the elements of the send list.
>>
>> There seems only 1 element in the send list (usually),
>> which is sent fully and successfully with no error from sendto()
>> and no EAGAIN being generated.
>>
>> So, I wonder where the EAGAIN is actually coming from:
>> Looking at the ktrace output below and the code it seems most
>> likely to be propagated from recvmmsg() to sendmmsg(), no?
>>
>> Wild guess: Is there an off-by-one in the while loop, which results
>> in sendmmsg() called again, which restores the previous errno
>> from recvmmsg() but does not hit sendto() again.
>>
>> So, your recent commit may just “hide" the actual “issue".
>
> You may be right, does this commit look better:
> https://github.com/NLnetLabs/nsd/commit/a74e23f4a4048b716b0a22b6d3a71dea7afcd104
>
> <https://github.com/NLnetLabs/nsd/commit/a74e23f4a4048b716b0a22b6d3a71dea7afcd104>
>
> It retries when UDP send buffers are full and sets the blocking flag so
> that the operation completes.
Hah. that’s a large hammer, which may work and seems the right thing
to do IFF EAGAIN is triggered from the sendto()/sendmmsg() due to full
buffers.
But from the ktrace the EAGAIN seems not come from sendto()/sendmmsg(),
since sendto() just finishes fine (see last line in ktrace below) -- so no
buffer
issues here. Actually, full ktrace contains only EAGAIN from recvfrom() and
none from sendto() (or any other syscall). As I wrote above, the EAGAIN
seems just “propagated” from recvmmsg() to sendmmsg() due to the errno
restore dance.
Why does the code postpone the error handling (and error logging) at all and
stores / restore errno for later iterations - why not instead handle the error
(logging)
immediately?
>>>>>>> 52438 nsd STRU struct kevent { ident=5, filter=EVFILT_READ,
>>>>>>> flags=0x1<EV_ADD>, fflags=0<>, data=62
>>>>>>> , udata=0x30562fe4198 }
>>>>>>> 55919 nsd STRU struct kevent { ident=5, filter=EVFILT_READ,
>>>>>>> flags=0x1<EV_ADD>, fflags=0<>, data=62
>>>>>>> , udata=0x30562fe4198 }
>>>>>>> 52438 nsd RET kevent 1
>>>>>>> 55919 nsd RET kevent 1
>>>>>>> 52438 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>>>> 55919 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>>>> 52438 nsd STRU struct timespec { 1224.307942259 }
>>>>>>> 55919 nsd STRU struct timespec { 1224.307946954 }
>>>>>>> 52438 nsd RET clock_gettime 0
>>>>>>> 55919 nsd RET clock_gettime 0
>>>>>>> 52438 nsd CALL
>>>>>>> recvfrom(5,0x30467ecd000,0x20109,0,0x304e553dc18,0x302669c2aa8)
>>>>>>> 55919 nsd CALL
>>>>>>> recvfrom(5,0x30497a9d000,0x20109,0,0x304e553dc18,0x302669c2aa8)
>>>>>>> 52438 nsd GIO fd 5 read 46 bytes
>>>>>>>
>>>>>>> "\aY\0\^P\0\^A\0\0\0\0\0\^A\^Dwpad\^Efoo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
>>>>>>> 55919 nsd RET recvfrom -1 errno 35 Resource temporarily
>>>>>>> unavailable
>>>>>>> 52438 nsd STRU struct sockaddr { AF_INET, 1.2.3.4:40276 }
>>>>>>> 55919 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>>>> 52438 nsd RET recvfrom 46/0x2e
>>>>>>> 55919 nsd STRU struct timespec { 1224.307967366 }
>>>>>>> 55919 nsd RET clock_gettime 0
>>>>>>> 52438 nsd CALL gettimeofday(0x7f7ffffbeed0,0)
>>>>>>> 55919 nsd CALL clock_gettime(CLOCK_MONOTONIC,0x7f7ffffbf1a0)
>>>>>>> 52438 nsd STRU struct timeval { 1592402542<"Jun 17 10:02:22
>>>>>>> 2020">.287233 }
>>>>>>> 55919 nsd STRU struct timespec { 1224.307977172 }
>>>>>>> 52438 nsd RET gettimeofday 0
>>>>>>> 55919 nsd RET clock_gettime 0
>>>>>>> 52438 nsd CALL sendto(5,0x30467ecd000,0x2e,0,0x304e553dc18,0x10)
>>>>>>> 55919 nsd CALL kevent(11,0,0,0x30549983000,64,0x7f7ffffbf0f8)
>>>>>>> 52438 nsd STRU struct sockaddr { AF_INET, 1.2.3.4:40276 }
>>>>>>> 55919 nsd STRU struct timespec { 109.487449000 }
>>>>>>> 52438 nsd GIO fd 5 wrote 46 bytes
>>>>>>>
>>>>>>> "\aY\M^@\^E\0\^A\0\0\0\0\0\^A\^Dwpad\^foo\^Fbar\0\0\^\\0\^A\0\0)\^P\0\0\0\M^@\0\0\0"
>>>>>>> 52438 nsd RET sendto 46/0x2e