I drilled down, eventually, to  the bottom of this problem.

When some kind of socket problem is reported on a UDP socket, then the
EV_READ is generated but  there is no data is available. So it will go in
indefinite cycle with libevent.

The solution is to use recvmsg on a socket in this condition, with flag
MSG_ERRQUEUE. It will read the error message and restore the socket
condition.

BSD-compatible systems do not do that, this is obviously a Linux
lesser-known extension.

I am working on including this workaround into the TURN server code, but my
pilot code tests shows that it works perfectly and this is the right
solution.

Thanks everybody !
Oleg



On Fri, Jul 19, 2013 at 8:34 PM, Oleg Moskalenko <mom040...@gmail.com>wrote:

> I tried to open/close the UDP socket. It helps. I suppose that this is a
> Linux UDP networking problem.
>
> Socket flag SO_BSDCOMPAT does not help.
>
> Reopening the socket is not very convenient, of course... but as this is
> an exceptional situation (file descriptors overload) then it may be OK.
>
> Thanks
> Oleg
>
>
>
> On Fri, Jul 19, 2013 at 8:25 PM, Oleg Moskalenko <mom040...@gmail.com>wrote:
>
>> OK, I checked that...
>>
>> The error returned by SO_ERROR is always 0.
>>
>> The socket is actually "alive": it would accept messages if sent to.
>>
>> I tried to change it to recvmsg. No changes... This is what I see from
>> strace:
>>
>> ..................................
>> [pid 24100] recvmsg(8, 0x7fff5e4803a0, MSG_PEEK) = -1 EAGAIN (Resource
>> temporarily unavailable)
>> [pid 24100] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 314) = 1
>> [pid 24100] clock_gettime(CLOCK_MONOTONIC, {254928, 717051324}) = 0
>> [pid 24100] gettimeofday({1374289798, 822877}, NULL) = 0
>> [pid 24100] recvmsg(8, 0x7fff5e4803a0, MSG_PEEK) = -1 EAGAIN (Resource
>> temporarily unavailable)
>> [pid 24100] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 313) = 1
>> [pid 24100] clock_gettime(CLOCK_MONOTONIC, {254928, 718103692}) = 0
>> [pid 24100] gettimeofday({1374289798, 823914}, NULL) = 0
>> ....................................
>>
>> Basically, the socket goes into a "gray" state - non-dead and
>> non-totally-alive.
>>
>> I wonder if I see the results of the "new" UDP Linux weird behavior (RFC
>> 1122) that many are complaining about, for example:
>>
>> http://web.mit.edu/Ghudson/info/linux.icmp
>>
>> I do not see anything like that in non-Linux *NIXes.
>>
>> Does it make any sense ? I am trying to figure out how it can be fixed at
>> all.
>>
>> Thanks
>> Oleg
>>
>>
>>
>>
>>
>> On Fri, Jul 19, 2013 at 8:28 AM, Nick Mathewson <ni...@freehaven.net>wrote:
>>
>>> On Fri, Jul 19, 2013 at 9:31 AM, Oleg Moskalenko <mom040...@gmail.com>
>>> wrote:
>>> > Thank you Azat for the suggestion. It seems to me that UDP sockets are
>>> > offenders, somehow it happens only in Linux (I know Linux has some
>>> weird UDP
>>> > behavior):
>>> >
>>> > Process 20828 attached with 5 threads - interrupt to quit
>>> > [pid 20831] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
>>> > [pid 20832] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
>>> > [pid 20831] <... clock_gettime resumed> {205614, 271115090}) = 0
>>> > [pid 20831] gettimeofday( <unfinished ...>
>>> > [pid 20832] <... clock_gettime resumed> {205614, 271926086}) = 0
>>> > [pid 20831] <... gettimeofday resumed> {1374240484, 377784}, NULL) = 0
>>> > [pid 20832] gettimeofday( <unfinished ...>
>>> > [pid 20831] epoll_wait(20,  <unfinished ...>
>>> > [pid 20829] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
>>> > [pid 20830] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
>>> > [pid 20832] <... gettimeofday resumed> {1374240484, 378418}, NULL) = 0
>>> > [pid 20832] epoll_wait(16,  <unfinished ...>
>>> > [pid 20830] <... clock_gettime resumed> {205614, 273231001}) = 0
>>> > [pid 20829] <... clock_gettime resumed> {205614, 272801617}) = 0
>>> > [pid 20829] gettimeofday( <unfinished ...>
>>> > [pid 20830] gettimeofday( <unfinished ...>
>>> > [pid 20829] <... gettimeofday resumed> {1374240484, 379094}, NULL) = 0
>>> > [pid 20829] epoll_wait(28,  <unfinished ...>
>>> > [pid 20830] <... gettimeofday resumed> {1374240484, 379317}, NULL) = 0
>>> > [pid 20830] epoll_wait(24,  <unfinished ...>
>>> > [pid 20828] recvfrom(8, 0x7fff61df20c0, 4, 2, 0xa9bc20,
>>> 0x7fff61df20bc) = -1
>>> > EAGAIN (Resource temporarily unavailable)
>>> > [pid 20828] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 19) = 1
>>> > [pid 20828] clock_gettime(CLOCK_MONOTONIC, {205614, 277088474}) = 0
>>> > [pid 20828] gettimeofday({1374240484, 386338}, NULL) = 0
>>> > [pid 20828] recvfrom(8, 0x7fff61df20c0, 4, 2, 0xa9bc20,
>>> 0x7fff61df20bc) = -1
>>> > EAGAIN (Resource temporarily unavailable)
>>> > [pid 20828] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 12) = 1
>>> > [pid 20828] clock_gettime(CLOCK_MONOTONIC, {205614, 286419826}) = 0
>>> > [pid 20828] gettimeofday({1374240484, 392232}, NULL) = 0
>>> > [pid 20828] recvfrom(8, 0x7fff61df20c0, 4, 2, 0xa9bc20,
>>> 0x7fff61df20bc) = -1
>>>
>>> Hm. So, epoll_wait is reporting EPOLLERR on fd 8.  The Libevent
>>> epoll.c code treats EPOLLERR as (EV_READ|EV_WRITE).  But when you
>>> recvfrom on the socket, it only says EAGAIN.
>>>
>>> So your program sensibly decides to keep listening for events on fd 8,
>>> and epoll keeps telling you that there was an error.
>>>
>>> Assuming that this recvfrom is in your code, I'll echo Vsevolod's
>>> question: what happens when you call getsockopt(...SO_ERROR...)  on
>>> the socket in the event handler that calls the recvfrom, to see what
>>> the queued error is?
>>>
>>> --
>>> Nick
>>> ***********************************************************************
>>> To unsubscribe, send an e-mail to majord...@freehaven.net with
>>> unsubscribe libevent-users    in the body.
>>>
>>
>>
>

Reply via email to