Under tcp memory pressure, calling epoll_wait() in edge triggered mode after -EAGAIN, can result in an indefinite hang in epoll_wait(), even when there is suffcient memory available to continue making progress. The problem is that __sk_mem_schedule() can return 0, under memory pressure without having set the SOCK_NOSPACE flag. Thus, even though all the outstanding packets have been acked, we never get the EPOLLOUT that we are expecting from epoll_wait().
This issue is currently limited to epoll when used in edge trigger mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE. This is sufficient for poll()/select() and epoll() in level trigger mode. However, in edge trigger mode, epoll() is relying on the write path to set SOCK_NOSPACE. So I view this patch as bringing us into sync with poll()/select() and epoll() level trigger behavior. I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule() will return 0, or failure more readily with SO_SNDBUF: 1) create socket and set SO_SNDBUF to N 2) add socket as edge trigger 3) write to socket and block in epoll on -EAGAIN 4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory() when the socket is non-blocking. Note that we could still hang if sk->sk_wmem_queue is 0, when we get the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we are waiting for and event that will never happen. I believe that this case is hard to hit (and did not hit in my testing), in that over the 'soft' limit, we continue to guarantee a minimum write buffer size. Perhaps, we could return -ENOSPC in this case...note that this case is not specific to epoll ET, but rather would affect all blocking and non-blocking sockets as well, and thus I think its ok to treat it as a separate case. Signed-off-by: Jason Baron <jba...@akamai.com> --- net/core/stream.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/net/core/stream.c b/net/core/stream.c index 301c05f..d70f77a 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -119,6 +119,7 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p) int err = 0; long vm_wait = 0; long current_timeo = *timeo_p; + bool noblock = (*timeo_p ? false : true); DEFINE_WAIT(wait); if (sk_stream_memory_free(sk)) @@ -131,8 +132,11 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p) if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) goto do_error; - if (!*timeo_p) + if (!*timeo_p) { + if (noblock) + set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); goto do_nonblock; + } if (signal_pending(current)) goto do_interrupted; clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); -- 1.8.2.rc2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html