Currently rtnl mutex is grabbed during fcoe create, destroy, enable
and disable operations while sysfs s_active read mutex is already
held, but simultaneously other networking events could try grabbing
write s_active mutex while rtnl is already held and that would cause
deadlock as reported by circular warning log pasted below.

In this log, the rtnl was held before write s_active during device
renaming but there are more such cases as Joe reported another
instance with tg3 open at:-
 http://www.open-fcoe.org/pipermail/devel/2010-February/008263.html

This patch fixes this issue by not waiting for rtnl lock during
fcoe create, destroy, enable and disable operations, thus in case
rtnl lock is not immediately available then return -EAGAIN error
to make s_active available again from fcoe ops, so that others
waiting in line could grab s_active to finish their work.

Caller can re-try same fcoe operation on -EAGAIN error (TBD).

Though LOCKDEP logs helped in identifying above described deadlock
between rtnl and s_active mutex but still not sure why LOCKDEP is
printing error bit differently with fcoe_config_mutex lock attempt
while s_active held. I don't see any way fcoe_config_mutex is
issue here beside grabbed between s_active and rtnl during only
fcoe operations. Am I missing any other possible deadlock with
fcoe_config_mutex ?  or simply this is the way LOCKDEP works
since anyway its log printed all locks held with possible circular
locking.

I considered removing rtnl_lock dependency completely on s_active
for fcoe ops but that path went down to too many existing code
changes beside need for another new lock to handle NETDEV_UNREGISTER
case, so I think added -EAGAIN and releasing s_active if rtnl is not
available seems better fix to me.

Any comments ?

TODO:-
1. More testing since very tough to hit this case.
2. Update apps to handle -EAGAIN error code.

--

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.33.1linux-stable-2.6.33 #1
-------------------------------------------------------
fcoemon/18823 is trying to acquire lock:
 (fcoe_config_mutex){+.+.+.}, at: [<ffffffffa02ba5fc>] fcoe_create+0x27/0x4f7
[fcoe]

but task is already holding lock:
 (s_active){++++.+}, at: [<ffffffff8115ef93>] sysfs_get_active_two+0x31/0x48

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (s_active){++++.+}:
       [<ffffffff81077bdb>] __lock_acquire+0xb73/0xd2b
       [<ffffffff81077e60>] lock_acquire+0xcd/0xf1
       [<ffffffff8115e5df>] sysfs_deactivate+0x8b/0xe0
       [<ffffffff8115edfb>] sysfs_addrm_finish+0x36/0x55
       [<ffffffff8115d0cc>] sysfs_hash_and_remove+0x53/0x6a
       [<ffffffff8115f353>] sysfs_remove_link+0x21/0x23
       [<ffffffff812b6c93>] device_rename+0x99/0xcb
       [<ffffffff8138dbf0>] dev_change_name+0xd5/0x1d2
       [<ffffffff8138deee>] dev_ifsioc+0x201/0x2ac
       [<ffffffff8138e4ba>] dev_ioctl+0x521/0x632
       [<ffffffff81379e43>] sock_do_ioctl+0x3d/0x47
       [<ffffffff8137a254>] sock_ioctl+0x213/0x222
       [<ffffffff81114614>] vfs_ioctl+0x32/0xa6
       [<ffffffff81114b94>] do_vfs_ioctl+0x490/0x4d6
       [<ffffffff81114c30>] sys_ioctl+0x56/0x79
       [<ffffffff81009b42>] system_call_fastpath+0x16/0x1b

-> #1 (rtnl_mutex){+.+.+.}:
       [<ffffffff81077bdb>] __lock_acquire+0xb73/0xd2b
       [<ffffffff81077e60>] lock_acquire+0xcd/0xf1
       [<ffffffff8142f343>] __mutex_lock_common+0x4b/0x383
       [<ffffffff8142f73f>] mutex_lock_nested+0x3e/0x43
       [<ffffffff813959f9>] rtnl_lock+0x17/0x19
       [<ffffffff8138ccae>] register_netdevice_notifier+0x1e/0x19b
       [<ffffffffa02580c1>] 0xffffffffa02580c1
       [<ffffffff81002069>] do_one_initcall+0x5e/0x15e
       [<ffffffff81084094>] sys_init_module+0xd8/0x23a
       [<ffffffff81009b42>] system_call_fastpath+0x16/0x1b

-> #0 (fcoe_config_mutex){+.+.+.}:
       [<ffffffff81077a85>] __lock_acquire+0xa1d/0xd2b
       [<ffffffff81077e60>] lock_acquire+0xcd/0xf1
       [<ffffffff8142f343>] __mutex_lock_common+0x4b/0x383
       [<ffffffff8142f73f>] mutex_lock_nested+0x3e/0x43
       [<ffffffffa02ba5fc>] fcoe_create+0x27/0x4f7 [fcoe]
       [<ffffffff810635b1>] param_attr_store+0x27/0x35
       [<ffffffff81063619>] module_attr_store+0x26/0x2a
       [<ffffffff8115dae3>] sysfs_write_file+0x108/0x144
       [<ffffffff81107bd1>] vfs_write+0xae/0x10b
       [<ffffffff81107cee>] sys_write+0x4a/0x6e
       [<ffffffff81009b42>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

3 locks held by fcoemon/18823:
 #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff8115da17>]
sysfs_write_file+0x3c/0x144
 #1:  (s_active){++++.+}, at: [<ffffffff8115ef86>]
sysfs_get_active_two+0x24/0x48
 #2:  (s_active){++++.+}, at: [<ffffffff8115ef93>]
sysfs_get_active_two+0x31/0x48

stack backtrace:
Pid: 18823, comm: fcoemon Tainted: G        W  2.6.33.1linux-stable-2.6.33 #1
Call Trace:
 [<ffffffff81076c38>] print_circular_bug+0xa8/0xb6
 [<ffffffff81077a85>] __lock_acquire+0xa1d/0xd2b
 [<ffffffffa02ba5fc>] ? fcoe_create+0x27/0x4f7 [fcoe]
 [<ffffffff81077e60>] lock_acquire+0xcd/0xf1
 [<ffffffffa02ba5fc>] ? fcoe_create+0x27/0x4f7 [fcoe]
 [<ffffffffa02ba5fc>] ? fcoe_create+0x27/0x4f7 [fcoe]
 [<ffffffff8142f343>] __mutex_lock_common+0x4b/0x383
 [<ffffffffa02ba5fc>] ? fcoe_create+0x27/0x4f7 [fcoe]
 [<ffffffff8106ac70>] ? cpu_clock+0x43/0x5e
 [<ffffffff81074e12>] ? lockstat_clock+0x11/0x13
 [<ffffffff81074e40>] ? lock_release_holdtime+0x2c/0x127
 [<ffffffff8115ef93>] ? sysfs_get_active_two+0x31/0x48
 [<ffffffff8142f73f>] mutex_lock_nested+0x3e/0x43
 [<ffffffffa02ba5fc>] fcoe_create+0x27/0x4f7 [fcoe]
 [<ffffffff810635b1>] param_attr_store+0x27/0x35
 [<ffffffff81063619>] module_attr_store+0x26/0x2a
 [<ffffffff8115dae3>] sysfs_write_file+0x108/0x144
 [<ffffffff81107bd1>] vfs_write+0xae/0x10b
 [<ffffffff81076596>] ? trace_hardirqs_on_caller+0x125/0x150
 [<ffffffff81107cee>] sys_write+0x4a/0x6e
 [<ffffffff81009b42>] system_call_fastpath+0x16/0x1b

Signed-off-by: Vasu Dev <[email protected]>
---

 drivers/scsi/fcoe/fcoe.c |   32 ++++++++++++++++++++++++++------
 1 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/fcoe/fcoe.c b/drivers/scsi/fcoe/fcoe.c
index 7022a16..51bba46 100644
--- a/drivers/scsi/fcoe/fcoe.c
+++ b/drivers/scsi/fcoe/fcoe.c
@@ -798,6 +798,8 @@ skip_oem:
 /**
  * fcoe_if_destroy() - Tear down a SW FCoE instance
  * @lport: The local port to be destroyed
+ *
+ * Locking: must be called with the RTNL mutex held
  */
 static void fcoe_if_destroy(struct fc_lport *lport)
 {
@@ -820,7 +822,6 @@ static void fcoe_if_destroy(struct fc_lport *lport)
        /* Free existing transmit skbs */
        fcoe_clean_pending_queue(lport);
 
-       rtnl_lock();
        if (!is_zero_ether_addr(port->data_src_addr))
                dev_unicast_delete(netdev, port->data_src_addr);
        rtnl_unlock();
@@ -1896,7 +1897,11 @@ static int fcoe_disable(const char *buffer, struct 
kernel_param *kp)
                goto out_nodev;
        }
 
-       rtnl_lock();
+       if (!rtnl_trylock()) {
+               rc = -EAGAIN;
+               goto out_putdev;
+       }
+
        fcoe = fcoe_hostlist_lookup_port(netdev);
        rtnl_unlock();
 
@@ -1906,6 +1911,7 @@ static int fcoe_disable(const char *buffer, struct 
kernel_param *kp)
        } else
                rc = -ENODEV;
 
+out_putdev:
        dev_put(netdev);
 out_nodev:
        mutex_unlock(&fcoe_config_mutex);
@@ -1946,7 +1952,11 @@ static int fcoe_enable(const char *buffer, struct 
kernel_param *kp)
                goto out_nodev;
        }
 
-       rtnl_lock();
+       if (!rtnl_trylock()) {
+               rc = -EAGAIN;
+               goto out_putdev;
+       }
+
        fcoe = fcoe_hostlist_lookup_port(netdev);
        rtnl_unlock();
 
@@ -1957,6 +1967,7 @@ static int fcoe_enable(const char *buffer, struct 
kernel_param *kp)
        } else
                rc = -ENODEV;
 
+out_putdev:
        dev_put(netdev);
 out_nodev:
        mutex_unlock(&fcoe_config_mutex);
@@ -1997,7 +2008,11 @@ static int fcoe_destroy(const char *buffer, struct 
kernel_param *kp)
                goto out_nodev;
        }
 
-       rtnl_lock();
+       if (!rtnl_trylock()) {
+               rc = -EAGAIN;
+               goto out_putdev;
+       }
+
        fcoe = fcoe_hostlist_lookup_port(netdev);
        if (!fcoe) {
                rtnl_unlock();
@@ -2006,7 +2021,6 @@ static int fcoe_destroy(const char *buffer, struct 
kernel_param *kp)
        }
        list_del(&fcoe->list);
        fcoe_interface_cleanup(fcoe);
-       rtnl_unlock();
        fcoe_if_destroy(fcoe->ctlr.lp);
        module_put(THIS_MODULE);
 
@@ -2027,6 +2041,7 @@ static void fcoe_destroy_work(struct work_struct *work)
 
        port = container_of(work, struct fcoe_port, destroy_work);
        mutex_lock(&fcoe_config_mutex);
+       rtnl_lock();
        fcoe_if_destroy(port->lport);
        mutex_unlock(&fcoe_config_mutex);
 }
@@ -2065,7 +2080,11 @@ static int fcoe_create(const char *buffer, struct 
kernel_param *kp)
                goto out_nomod;
        }
 
-       rtnl_lock();
+       if (!rtnl_trylock()) {
+               rc = -EAGAIN;
+               goto out_mod;
+       }
+
        netdev = fcoe_if_to_netdev(buffer);
        if (!netdev) {
                rc = -ENODEV;
@@ -2121,6 +2140,7 @@ out_putdev:
        dev_put(netdev);
 out_nodev:
        rtnl_unlock();
+out_mod:
        module_put(THIS_MODULE);
 out_nomod:
        mutex_unlock(&fcoe_config_mutex);

_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel

Reply via email to