On 10/24/2011 5:22 PM, Vasu Dev wrote:
On Mon, 2011-10-24 at 14:36 -0700, Bhanu Prakash Gollapudi wrote:
On 10/24/2011 1:42 PM, Neil Horman wrote:
On Mon, Oct 24, 2011 at 11:59:52AM -0700, Zou, Yi wrote:


On 10/21/2011 4:36 PM, Neil Horman wrote:
This oops was reported to me recently:
PID: 5176   TASK: ffff880215274100  CPU: 0   COMMAND: "fc_rport_eq"
0 [ffff880218c65760] machine_kexec at ffffffff81031d3b
1 [ffff880218c657c0] crash_kexec at ffffffff810b8e92
2 [ffff880218c65890] oops_end at ffffffff814ef890
3 [ffff880218c658c0] no_context at ffffffff8104226b
4 [ffff880218c65910] __bad_area_nosemaphore at ffffffff810424f5
5 [ffff880218c65960] bad_area_nosemaphore at ffffffff810425c3
6 [ffff880218c65970] __do_page_fault at ffffffff81042c9d
7 [ffff880218c65a90] do_page_fault at ffffffff814f186e
8 [ffff880218c65ac0] page_fault at ffffffff814eec25
9 [ffff880218c65bb8] fc_fcp_complete_locked at ffffffffa02ed739 [libfc]
10 [ffff880218c65c08] fc_fcp_retry_cmd at ffffffffa02ed86f [libfc]
11 [ffff880218c65c28] fc_fcp_recv at ffffffffa02eed3f [libfc]
12 [ffff880218c65d28] fc_exch_mgr_reset at ffffffffa02e2373 [libfc]
13 [ffff880218c65db8] fc_rport_work at ffffffffa02e9f10 [libfc]
14 [ffff880218c65e38] worker_thread at ffffffff8108b250
15 [ffff880218c65ee8] kthread at ffffffff81090806
16 [ffff880218c65f48] kernel_thread at ffffffff8100c10a

It results from two contexts that try to manipulate the same
fcoe_exch_pool
without syncronizing themselves:

1) The fcoe event_work workqueue which calls
fc_rport_work
    fc_exch_mgr_reset
      fc_exch_pool_reset

2) The FCOE transport destroy path, which schedules a destroy_work
workqueue,
calling:
fcoe_destroy_work
    fcoe_if_destroy
     fc_exch_mgr_free
      fc_exch_mgr_del
       fc_exch_mgr_destroy

The pool_reset path holds the pool look, but no references to the pool
manager
kobject, while exch_mgr_destroy path drops what is ostensibly the last
reference to the pool manager kobject (causing its freeing), while not
holding
the pool lock.

The attached patch has been confirmed to prevent the panic.

Signed-off-by: Neil Horman<nhor...@tuxdriver.com>
CC: Robert Love<robert.w.l...@intel.com>
---
    drivers/scsi/libfc/fc_exch.c |    2 ++
    1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/libfc/fc_exch.c
b/drivers/scsi/libfc/fc_exch.c
index 8325489..d7c8100 100644
--- a/drivers/scsi/libfc/fc_exch.c
+++ b/drivers/scsi/libfc/fc_exch.c
@@ -1815,10 +1815,12 @@ void fc_exch_mgr_reset(struct fc_lport *lport,
u32 sid, u32 did)
        unsigned int cpu;

        list_for_each_entry(ema,&lport->ema_list, ema_list) {
+               kref_get(&ema->mp->kref)

If the above contexts are not synchronized, we may be accessing illegal
pointer in ema.  Just before calling kref_get(), the ema may have been
free'd in fc_exch_mgr_del(). We do not still have protection for it,
right? am I missing something?

Are we not holding lport lock when doing fc_exch_mgr_free() to guard the 
ema_list?
I thought we are, I might be wrong.

I thought that lock got grabbed somwhere in fcoe_destroy_work
I do not see lport lock being taken while calling fc_exch_mgr_free().
Are you referring to fcoe_config_mutex? If so, that will serialize exch
mgr add/del operations, not the ema_list traversal in fc_exch_mgr_reset
from fc_rport_work context while the ema delete is going-on in another
context.

I also don't see lport lock used here serializing here and may cause
circular locking around fc_exch_mgr_reset().


I think this condition should not have happened in the first place.
fc_fabric_logoff() should have waited for all the rport_works to
complete. So, when fc_exch_mgr_free() is called the rport_work should
have been completed. Is there a different scenario to see the issue?

It can happen as Neil described in the  patch due to deferred rport
worker executing after destroy thread done executing which also frees up
all resources including EM/Pools.

Vasu, I don't think that can happen.

The issues showed up with EM resources
but rport worker can try accessing freed lport also, so I think fix
should be to flush pending rport worker thread before proceeding with
freeing EM or lport resource.

Current code already does this.

rport worker threads are getting flushed in fc_fabric_logoff by calling lport->tt.rport_flush_queue(), which flushes rport_event_queue. This is called before freeing up all resources including em/pools (fc_exch_mgr_free) in fcoe_if_destroy().

Thanks,
Bhanu



Vasu





_______________________________________________
devel mailing list
devel@open-fcoe.org
https://lists.open-fcoe.org/mailman/listinfo/devel

Reply via email to