Currently rtnl mutex is grabbed during fcoe create, destroy, enable and disable operations while sysfs s_active read mutex is already held, but simultaneously other networking events could try grabbing write s_active mutex while rtnl is already held and that would cause deadlock as reported by circular warning log pasted below.
In this log, the rtnl was held before write s_active during device renaming but there are more such cases as Joe reported another instance with tg3 open at:- http://www.open-fcoe.org/pipermail/devel/2010-February/008263.html This patch fixes this issue by not waiting for rtnl lock during fcoe create, destroy, enable and disable operations, thus in case rtnl lock is not immediately available then return -EAGAIN error to make s_active available again from fcoe ops, so that others waiting in line could grab s_active to finish their work. Caller can re-try same fcoe operation on -EAGAIN error (TBD). Though LOCKDEP logs helped in identifying above described deadlock between rtnl and s_active mutex but still not sure why LOCKDEP is printing error bit differently with fcoe_config_mutex lock attempt while s_active held. I don't see any way fcoe_config_mutex is issue here beside grabbed between s_active and rtnl during only fcoe operations. Am I missing any other possible deadlock with fcoe_config_mutex ? or simply this is the way LOCKDEP works since anyway its log printed all locks held with possible circular locking. I considered removing rtnl_lock dependency completely on s_active for fcoe ops but that path went down to too many existing code changes beside need for another new lock to handle NETDEV_UNREGISTER case, so I think added -EAGAIN and releasing s_active if rtnl is not available seems better fix to me. Any comments ? TODO:- 1. More testing since very tough to hit this case. 2. Update apps to handle -EAGAIN error code. -- ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.33.1linux-stable-2.6.33 #1 ------------------------------------------------------- fcoemon/18823 is trying to acquire lock: (fcoe_config_mutex){+.+.+.}, at: [<ffffffffa02ba5fc>] fcoe_create+0x27/0x4f7 [fcoe] but task is already holding lock: (s_active){++++.+}, at: [<ffffffff8115ef93>] sysfs_get_active_two+0x31/0x48 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (s_active){++++.+}: [<ffffffff81077bdb>] __lock_acquire+0xb73/0xd2b [<ffffffff81077e60>] lock_acquire+0xcd/0xf1 [<ffffffff8115e5df>] sysfs_deactivate+0x8b/0xe0 [<ffffffff8115edfb>] sysfs_addrm_finish+0x36/0x55 [<ffffffff8115d0cc>] sysfs_hash_and_remove+0x53/0x6a [<ffffffff8115f353>] sysfs_remove_link+0x21/0x23 [<ffffffff812b6c93>] device_rename+0x99/0xcb [<ffffffff8138dbf0>] dev_change_name+0xd5/0x1d2 [<ffffffff8138deee>] dev_ifsioc+0x201/0x2ac [<ffffffff8138e4ba>] dev_ioctl+0x521/0x632 [<ffffffff81379e43>] sock_do_ioctl+0x3d/0x47 [<ffffffff8137a254>] sock_ioctl+0x213/0x222 [<ffffffff81114614>] vfs_ioctl+0x32/0xa6 [<ffffffff81114b94>] do_vfs_ioctl+0x490/0x4d6 [<ffffffff81114c30>] sys_ioctl+0x56/0x79 [<ffffffff81009b42>] system_call_fastpath+0x16/0x1b -> #1 (rtnl_mutex){+.+.+.}: [<ffffffff81077bdb>] __lock_acquire+0xb73/0xd2b [<ffffffff81077e60>] lock_acquire+0xcd/0xf1 [<ffffffff8142f343>] __mutex_lock_common+0x4b/0x383 [<ffffffff8142f73f>] mutex_lock_nested+0x3e/0x43 [<ffffffff813959f9>] rtnl_lock+0x17/0x19 [<ffffffff8138ccae>] register_netdevice_notifier+0x1e/0x19b [<ffffffffa02580c1>] 0xffffffffa02580c1 [<ffffffff81002069>] do_one_initcall+0x5e/0x15e [<ffffffff81084094>] sys_init_module+0xd8/0x23a [<ffffffff81009b42>] system_call_fastpath+0x16/0x1b -> #0 (fcoe_config_mutex){+.+.+.}: [<ffffffff81077a85>] __lock_acquire+0xa1d/0xd2b [<ffffffff81077e60>] lock_acquire+0xcd/0xf1 [<ffffffff8142f343>] __mutex_lock_common+0x4b/0x383 [<ffffffff8142f73f>] mutex_lock_nested+0x3e/0x43 [<ffffffffa02ba5fc>] fcoe_create+0x27/0x4f7 [fcoe] [<ffffffff810635b1>] param_attr_store+0x27/0x35 [<ffffffff81063619>] module_attr_store+0x26/0x2a [<ffffffff8115dae3>] sysfs_write_file+0x108/0x144 [<ffffffff81107bd1>] vfs_write+0xae/0x10b [<ffffffff81107cee>] sys_write+0x4a/0x6e [<ffffffff81009b42>] system_call_fastpath+0x16/0x1b other info that might help us debug this: 3 locks held by fcoemon/18823: #0: (&buffer->mutex){+.+.+.}, at: [<ffffffff8115da17>] sysfs_write_file+0x3c/0x144 #1: (s_active){++++.+}, at: [<ffffffff8115ef86>] sysfs_get_active_two+0x24/0x48 #2: (s_active){++++.+}, at: [<ffffffff8115ef93>] sysfs_get_active_two+0x31/0x48 stack backtrace: Pid: 18823, comm: fcoemon Tainted: G W 2.6.33.1linux-stable-2.6.33 #1 Call Trace: [<ffffffff81076c38>] print_circular_bug+0xa8/0xb6 [<ffffffff81077a85>] __lock_acquire+0xa1d/0xd2b [<ffffffffa02ba5fc>] ? fcoe_create+0x27/0x4f7 [fcoe] [<ffffffff81077e60>] lock_acquire+0xcd/0xf1 [<ffffffffa02ba5fc>] ? fcoe_create+0x27/0x4f7 [fcoe] [<ffffffffa02ba5fc>] ? fcoe_create+0x27/0x4f7 [fcoe] [<ffffffff8142f343>] __mutex_lock_common+0x4b/0x383 [<ffffffffa02ba5fc>] ? fcoe_create+0x27/0x4f7 [fcoe] [<ffffffff8106ac70>] ? cpu_clock+0x43/0x5e [<ffffffff81074e12>] ? lockstat_clock+0x11/0x13 [<ffffffff81074e40>] ? lock_release_holdtime+0x2c/0x127 [<ffffffff8115ef93>] ? sysfs_get_active_two+0x31/0x48 [<ffffffff8142f73f>] mutex_lock_nested+0x3e/0x43 [<ffffffffa02ba5fc>] fcoe_create+0x27/0x4f7 [fcoe] [<ffffffff810635b1>] param_attr_store+0x27/0x35 [<ffffffff81063619>] module_attr_store+0x26/0x2a [<ffffffff8115dae3>] sysfs_write_file+0x108/0x144 [<ffffffff81107bd1>] vfs_write+0xae/0x10b [<ffffffff81076596>] ? trace_hardirqs_on_caller+0x125/0x150 [<ffffffff81107cee>] sys_write+0x4a/0x6e [<ffffffff81009b42>] system_call_fastpath+0x16/0x1b Signed-off-by: Vasu Dev <[email protected]> --- drivers/scsi/fcoe/fcoe.c | 32 ++++++++++++++++++++++++++------ 1 files changed, 26 insertions(+), 6 deletions(-) diff --git a/drivers/scsi/fcoe/fcoe.c b/drivers/scsi/fcoe/fcoe.c index 7022a16..51bba46 100644 --- a/drivers/scsi/fcoe/fcoe.c +++ b/drivers/scsi/fcoe/fcoe.c @@ -798,6 +798,8 @@ skip_oem: /** * fcoe_if_destroy() - Tear down a SW FCoE instance * @lport: The local port to be destroyed + * + * Locking: must be called with the RTNL mutex held */ static void fcoe_if_destroy(struct fc_lport *lport) { @@ -820,7 +822,6 @@ static void fcoe_if_destroy(struct fc_lport *lport) /* Free existing transmit skbs */ fcoe_clean_pending_queue(lport); - rtnl_lock(); if (!is_zero_ether_addr(port->data_src_addr)) dev_unicast_delete(netdev, port->data_src_addr); rtnl_unlock(); @@ -1896,7 +1897,11 @@ static int fcoe_disable(const char *buffer, struct kernel_param *kp) goto out_nodev; } - rtnl_lock(); + if (!rtnl_trylock()) { + rc = -EAGAIN; + goto out_putdev; + } + fcoe = fcoe_hostlist_lookup_port(netdev); rtnl_unlock(); @@ -1906,6 +1911,7 @@ static int fcoe_disable(const char *buffer, struct kernel_param *kp) } else rc = -ENODEV; +out_putdev: dev_put(netdev); out_nodev: mutex_unlock(&fcoe_config_mutex); @@ -1946,7 +1952,11 @@ static int fcoe_enable(const char *buffer, struct kernel_param *kp) goto out_nodev; } - rtnl_lock(); + if (!rtnl_trylock()) { + rc = -EAGAIN; + goto out_putdev; + } + fcoe = fcoe_hostlist_lookup_port(netdev); rtnl_unlock(); @@ -1957,6 +1967,7 @@ static int fcoe_enable(const char *buffer, struct kernel_param *kp) } else rc = -ENODEV; +out_putdev: dev_put(netdev); out_nodev: mutex_unlock(&fcoe_config_mutex); @@ -1997,7 +2008,11 @@ static int fcoe_destroy(const char *buffer, struct kernel_param *kp) goto out_nodev; } - rtnl_lock(); + if (!rtnl_trylock()) { + rc = -EAGAIN; + goto out_putdev; + } + fcoe = fcoe_hostlist_lookup_port(netdev); if (!fcoe) { rtnl_unlock(); @@ -2006,7 +2021,6 @@ static int fcoe_destroy(const char *buffer, struct kernel_param *kp) } list_del(&fcoe->list); fcoe_interface_cleanup(fcoe); - rtnl_unlock(); fcoe_if_destroy(fcoe->ctlr.lp); module_put(THIS_MODULE); @@ -2027,6 +2041,7 @@ static void fcoe_destroy_work(struct work_struct *work) port = container_of(work, struct fcoe_port, destroy_work); mutex_lock(&fcoe_config_mutex); + rtnl_lock(); fcoe_if_destroy(port->lport); mutex_unlock(&fcoe_config_mutex); } @@ -2065,7 +2080,11 @@ static int fcoe_create(const char *buffer, struct kernel_param *kp) goto out_nomod; } - rtnl_lock(); + if (!rtnl_trylock()) { + rc = -EAGAIN; + goto out_mod; + } + netdev = fcoe_if_to_netdev(buffer); if (!netdev) { rc = -ENODEV; @@ -2121,6 +2140,7 @@ out_putdev: dev_put(netdev); out_nodev: rtnl_unlock(); +out_mod: module_put(THIS_MODULE); out_nomod: mutex_unlock(&fcoe_config_mutex); _______________________________________________ devel mailing list [email protected] http://www.open-fcoe.org/mailman/listinfo/devel
