I drilled down, eventually, to the bottom of this problem. When some kind of socket problem is reported on a UDP socket, then the EV_READ is generated but there is no data is available. So it will go in indefinite cycle with libevent.
The solution is to use recvmsg on a socket in this condition, with flag MSG_ERRQUEUE. It will read the error message and restore the socket condition. BSD-compatible systems do not do that, this is obviously a Linux lesser-known extension. I am working on including this workaround into the TURN server code, but my pilot code tests shows that it works perfectly and this is the right solution. Thanks everybody ! Oleg On Fri, Jul 19, 2013 at 8:34 PM, Oleg Moskalenko <mom040...@gmail.com>wrote: > I tried to open/close the UDP socket. It helps. I suppose that this is a > Linux UDP networking problem. > > Socket flag SO_BSDCOMPAT does not help. > > Reopening the socket is not very convenient, of course... but as this is > an exceptional situation (file descriptors overload) then it may be OK. > > Thanks > Oleg > > > > On Fri, Jul 19, 2013 at 8:25 PM, Oleg Moskalenko <mom040...@gmail.com>wrote: > >> OK, I checked that... >> >> The error returned by SO_ERROR is always 0. >> >> The socket is actually "alive": it would accept messages if sent to. >> >> I tried to change it to recvmsg. No changes... This is what I see from >> strace: >> >> .................................. >> [pid 24100] recvmsg(8, 0x7fff5e4803a0, MSG_PEEK) = -1 EAGAIN (Resource >> temporarily unavailable) >> [pid 24100] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 314) = 1 >> [pid 24100] clock_gettime(CLOCK_MONOTONIC, {254928, 717051324}) = 0 >> [pid 24100] gettimeofday({1374289798, 822877}, NULL) = 0 >> [pid 24100] recvmsg(8, 0x7fff5e4803a0, MSG_PEEK) = -1 EAGAIN (Resource >> temporarily unavailable) >> [pid 24100] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 313) = 1 >> [pid 24100] clock_gettime(CLOCK_MONOTONIC, {254928, 718103692}) = 0 >> [pid 24100] gettimeofday({1374289798, 823914}, NULL) = 0 >> .................................... >> >> Basically, the socket goes into a "gray" state - non-dead and >> non-totally-alive. >> >> I wonder if I see the results of the "new" UDP Linux weird behavior (RFC >> 1122) that many are complaining about, for example: >> >> http://web.mit.edu/Ghudson/info/linux.icmp >> >> I do not see anything like that in non-Linux *NIXes. >> >> Does it make any sense ? I am trying to figure out how it can be fixed at >> all. >> >> Thanks >> Oleg >> >> >> >> >> >> On Fri, Jul 19, 2013 at 8:28 AM, Nick Mathewson <ni...@freehaven.net>wrote: >> >>> On Fri, Jul 19, 2013 at 9:31 AM, Oleg Moskalenko <mom040...@gmail.com> >>> wrote: >>> > Thank you Azat for the suggestion. It seems to me that UDP sockets are >>> > offenders, somehow it happens only in Linux (I know Linux has some >>> weird UDP >>> > behavior): >>> > >>> > Process 20828 attached with 5 threads - interrupt to quit >>> > [pid 20831] clock_gettime(CLOCK_MONOTONIC, <unfinished ...> >>> > [pid 20832] clock_gettime(CLOCK_MONOTONIC, <unfinished ...> >>> > [pid 20831] <... clock_gettime resumed> {205614, 271115090}) = 0 >>> > [pid 20831] gettimeofday( <unfinished ...> >>> > [pid 20832] <... clock_gettime resumed> {205614, 271926086}) = 0 >>> > [pid 20831] <... gettimeofday resumed> {1374240484, 377784}, NULL) = 0 >>> > [pid 20832] gettimeofday( <unfinished ...> >>> > [pid 20831] epoll_wait(20, <unfinished ...> >>> > [pid 20829] clock_gettime(CLOCK_MONOTONIC, <unfinished ...> >>> > [pid 20830] clock_gettime(CLOCK_MONOTONIC, <unfinished ...> >>> > [pid 20832] <... gettimeofday resumed> {1374240484, 378418}, NULL) = 0 >>> > [pid 20832] epoll_wait(16, <unfinished ...> >>> > [pid 20830] <... clock_gettime resumed> {205614, 273231001}) = 0 >>> > [pid 20829] <... clock_gettime resumed> {205614, 272801617}) = 0 >>> > [pid 20829] gettimeofday( <unfinished ...> >>> > [pid 20830] gettimeofday( <unfinished ...> >>> > [pid 20829] <... gettimeofday resumed> {1374240484, 379094}, NULL) = 0 >>> > [pid 20829] epoll_wait(28, <unfinished ...> >>> > [pid 20830] <... gettimeofday resumed> {1374240484, 379317}, NULL) = 0 >>> > [pid 20830] epoll_wait(24, <unfinished ...> >>> > [pid 20828] recvfrom(8, 0x7fff61df20c0, 4, 2, 0xa9bc20, >>> 0x7fff61df20bc) = -1 >>> > EAGAIN (Resource temporarily unavailable) >>> > [pid 20828] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 19) = 1 >>> > [pid 20828] clock_gettime(CLOCK_MONOTONIC, {205614, 277088474}) = 0 >>> > [pid 20828] gettimeofday({1374240484, 386338}, NULL) = 0 >>> > [pid 20828] recvfrom(8, 0x7fff61df20c0, 4, 2, 0xa9bc20, >>> 0x7fff61df20bc) = -1 >>> > EAGAIN (Resource temporarily unavailable) >>> > [pid 20828] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 12) = 1 >>> > [pid 20828] clock_gettime(CLOCK_MONOTONIC, {205614, 286419826}) = 0 >>> > [pid 20828] gettimeofday({1374240484, 392232}, NULL) = 0 >>> > [pid 20828] recvfrom(8, 0x7fff61df20c0, 4, 2, 0xa9bc20, >>> 0x7fff61df20bc) = -1 >>> >>> Hm. So, epoll_wait is reporting EPOLLERR on fd 8. The Libevent >>> epoll.c code treats EPOLLERR as (EV_READ|EV_WRITE). But when you >>> recvfrom on the socket, it only says EAGAIN. >>> >>> So your program sensibly decides to keep listening for events on fd 8, >>> and epoll keeps telling you that there was an error. >>> >>> Assuming that this recvfrom is in your code, I'll echo Vsevolod's >>> question: what happens when you call getsockopt(...SO_ERROR...) on >>> the socket in the event handler that calls the recvfrom, to see what >>> the queued error is? >>> >>> -- >>> Nick >>> *********************************************************************** >>> To unsubscribe, send an e-mail to majord...@freehaven.net with >>> unsubscribe libevent-users in the body. >>> >> >> >