This patchset follows up on the root-cause mentioned in https://www.spinics.net/lists/netdev/msg472849.html
Patch1 implements some code refactoring that was suggeseted as an enhancement in http://patchwork.ozlabs.org/patch/843157/ It replaces the c_destroy_in_prog bit in rds_connection with an atomically managed flag in rds_conn_path. Patch2 builds on Patch1 and uses RCU to make sure that work is only enqueued if the connection destroy is not already in progress: the test-flag-and-enqueue is done under rcu_read_lock, while destroy first sets the flag, uses synchronize_rcu to wait for existing reader threads to complete, and then starts all the work-cancellation. Since I have not been able to reproduce the original stack traces reported by syszbot, and these are fixes for a race condition that are based on code-inspection I am not marking these as reported-by at this time. Sowmini Varadhan (2): rds: Use atomic flag to track connections being destroyed rds: Ensure that send/recv/reconnect work cannot be requeued from softirq or proc context net/rds/cong.c | 10 +++++++--- net/rds/connection.c | 24 +++++++++++++++++++----- net/rds/rds.h | 4 ++-- net/rds/send.c | 37 ++++++++++++++++++++++++++++++++----- net/rds/tcp_connect.c | 2 +- net/rds/tcp_recv.c | 8 ++++++-- net/rds/tcp_send.c | 5 ++++- net/rds/threads.c | 20 +++++++++++++++----- 8 files changed, 86 insertions(+), 24 deletions(-)
