On 10/21/2011 4:36 PM, Neil Horman wrote:
This oops was reported to me recently:
PID: 5176   TASK: ffff880215274100  CPU: 0   COMMAND: "fc_rport_eq"
0 [ffff880218c65760] machine_kexec at ffffffff81031d3b
1 [ffff880218c657c0] crash_kexec at ffffffff810b8e92
2 [ffff880218c65890] oops_end at ffffffff814ef890
3 [ffff880218c658c0] no_context at ffffffff8104226b
4 [ffff880218c65910] __bad_area_nosemaphore at ffffffff810424f5
5 [ffff880218c65960] bad_area_nosemaphore at ffffffff810425c3
6 [ffff880218c65970] __do_page_fault at ffffffff81042c9d
7 [ffff880218c65a90] do_page_fault at ffffffff814f186e
8 [ffff880218c65ac0] page_fault at ffffffff814eec25
9 [ffff880218c65bb8] fc_fcp_complete_locked at ffffffffa02ed739 [libfc]
10 [ffff880218c65c08] fc_fcp_retry_cmd at ffffffffa02ed86f [libfc]
11 [ffff880218c65c28] fc_fcp_recv at ffffffffa02eed3f [libfc]
12 [ffff880218c65d28] fc_exch_mgr_reset at ffffffffa02e2373 [libfc]
13 [ffff880218c65db8] fc_rport_work at ffffffffa02e9f10 [libfc]
14 [ffff880218c65e38] worker_thread at ffffffff8108b250
15 [ffff880218c65ee8] kthread at ffffffff81090806
16 [ffff880218c65f48] kernel_thread at ffffffff8100c10a

It results from two contexts that try to manipulate the same fcoe_exch_pool
without syncronizing themselves:

1) The fcoe event_work workqueue which calls
fc_rport_work
  fc_exch_mgr_reset
    fc_exch_pool_reset

2) The FCOE transport destroy path, which schedules a destroy_work workqueue,
calling:
fcoe_destroy_work
  fcoe_if_destroy
   fc_exch_mgr_free
    fc_exch_mgr_del
     fc_exch_mgr_destroy

The pool_reset path holds the pool look, but no references to the pool manager
kobject, while exch_mgr_destroy path drops what is ostensibly the last
reference to the pool manager kobject (causing its freeing), while not holding
the pool lock.

The attached patch has been confirmed to prevent the panic.

Signed-off-by: Neil Horman<nhor...@tuxdriver.com>
CC: Robert Love<robert.w.l...@intel.com>
---
  drivers/scsi/libfc/fc_exch.c |    2 ++
  1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
index 8325489..d7c8100 100644
--- a/drivers/scsi/libfc/fc_exch.c
+++ b/drivers/scsi/libfc/fc_exch.c
@@ -1815,10 +1815,12 @@ void fc_exch_mgr_reset(struct fc_lport *lport, u32 sid, 
u32 did)
        unsigned int cpu;

        list_for_each_entry(ema,&lport->ema_list, ema_list) {
+               kref_get(&ema->mp->kref)

If the above contexts are not synchronized, we may be accessing illegal pointer in ema. Just before calling kref_get(), the ema may have been free'd in fc_exch_mgr_del(). We do not still have protection for it, right? am I missing something?


                for_each_possible_cpu(cpu)
                        fc_exch_pool_reset(lport,
                                           per_cpu_ptr(ema->mp->pool, cpu),
                                           sid, did);
+               kref_put(&ema->mp->kref, fc_exch_mgr_destroy);
        }
  }
  EXPORT_SYMBOL(fc_exch_mgr_reset);


_______________________________________________
devel mailing list
devel@open-fcoe.org
https://lists.open-fcoe.org/mailman/listinfo/devel

Reply via email to