On Mon, 2011-10-24 at 14:36 -0700, Bhanu Prakash Gollapudi wrote: > On 10/24/2011 1:42 PM, Neil Horman wrote: > > On Mon, Oct 24, 2011 at 11:59:52AM -0700, Zou, Yi wrote: > >> > >>> > >>> On 10/21/2011 4:36 PM, Neil Horman wrote: > >>>> This oops was reported to me recently: > >>>> PID: 5176 TASK: ffff880215274100 CPU: 0 COMMAND: "fc_rport_eq" > >>>> 0 [ffff880218c65760] machine_kexec at ffffffff81031d3b > >>>> 1 [ffff880218c657c0] crash_kexec at ffffffff810b8e92 > >>>> 2 [ffff880218c65890] oops_end at ffffffff814ef890 > >>>> 3 [ffff880218c658c0] no_context at ffffffff8104226b > >>>> 4 [ffff880218c65910] __bad_area_nosemaphore at ffffffff810424f5 > >>>> 5 [ffff880218c65960] bad_area_nosemaphore at ffffffff810425c3 > >>>> 6 [ffff880218c65970] __do_page_fault at ffffffff81042c9d > >>>> 7 [ffff880218c65a90] do_page_fault at ffffffff814f186e > >>>> 8 [ffff880218c65ac0] page_fault at ffffffff814eec25 > >>>> 9 [ffff880218c65bb8] fc_fcp_complete_locked at ffffffffa02ed739 [libfc] > >>>> 10 [ffff880218c65c08] fc_fcp_retry_cmd at ffffffffa02ed86f [libfc] > >>>> 11 [ffff880218c65c28] fc_fcp_recv at ffffffffa02eed3f [libfc] > >>>> 12 [ffff880218c65d28] fc_exch_mgr_reset at ffffffffa02e2373 [libfc] > >>>> 13 [ffff880218c65db8] fc_rport_work at ffffffffa02e9f10 [libfc] > >>>> 14 [ffff880218c65e38] worker_thread at ffffffff8108b250 > >>>> 15 [ffff880218c65ee8] kthread at ffffffff81090806 > >>>> 16 [ffff880218c65f48] kernel_thread at ffffffff8100c10a > >>>> > >>>> It results from two contexts that try to manipulate the same > >>> fcoe_exch_pool > >>>> without syncronizing themselves: > >>>> > >>>> 1) The fcoe event_work workqueue which calls > >>>> fc_rport_work > >>>> fc_exch_mgr_reset > >>>> fc_exch_pool_reset > >>>> > >>>> 2) The FCOE transport destroy path, which schedules a destroy_work > >>> workqueue, > >>>> calling: > >>>> fcoe_destroy_work > >>>> fcoe_if_destroy > >>>> fc_exch_mgr_free > >>>> fc_exch_mgr_del > >>>> fc_exch_mgr_destroy > >>>> > >>>> The pool_reset path holds the pool look, but no references to the pool > >>> manager > >>>> kobject, while exch_mgr_destroy path drops what is ostensibly the last > >>>> reference to the pool manager kobject (causing its freeing), while not > >>> holding > >>>> the pool lock. > >>>> > >>>> The attached patch has been confirmed to prevent the panic. > >>>> > >>>> Signed-off-by: Neil Horman<nhor...@tuxdriver.com> > >>>> CC: Robert Love<robert.w.l...@intel.com> > >>>> --- > >>>> drivers/scsi/libfc/fc_exch.c | 2 ++ > >>>> 1 files changed, 2 insertions(+), 0 deletions(-) > >>>> > >>>> diff --git a/drivers/scsi/libfc/fc_exch.c > >>> b/drivers/scsi/libfc/fc_exch.c > >>>> index 8325489..d7c8100 100644 > >>>> --- a/drivers/scsi/libfc/fc_exch.c > >>>> +++ b/drivers/scsi/libfc/fc_exch.c > >>>> @@ -1815,10 +1815,12 @@ void fc_exch_mgr_reset(struct fc_lport *lport, > >>> u32 sid, u32 did) > >>>> unsigned int cpu; > >>>> > >>>> list_for_each_entry(ema,&lport->ema_list, ema_list) { > >>>> + kref_get(&ema->mp->kref) > >>> > >>> If the above contexts are not synchronized, we may be accessing illegal > >>> pointer in ema. Just before calling kref_get(), the ema may have been > >>> free'd in fc_exch_mgr_del(). We do not still have protection for it, > >>> right? am I missing something? > >>> > >> Are we not holding lport lock when doing fc_exch_mgr_free() to guard the > >> ema_list? > >> I thought we are, I might be wrong. > >> > > I thought that lock got grabbed somwhere in fcoe_destroy_work > I do not see lport lock being taken while calling fc_exch_mgr_free(). > Are you referring to fcoe_config_mutex? If so, that will serialize exch > mgr add/del operations, not the ema_list traversal in fc_exch_mgr_reset > from fc_rport_work context while the ema delete is going-on in another > context.
I also don't see lport lock used here serializing here and may cause circular locking around fc_exch_mgr_reset(). > > I think this condition should not have happened in the first place. > fc_fabric_logoff() should have waited for all the rport_works to > complete. So, when fc_exch_mgr_free() is called the rport_work should > have been completed. Is there a different scenario to see the issue? It can happen as Neil described in the patch due to deferred rport worker executing after destroy thread done executing which also frees up all resources including EM/Pools. The issues showed up with EM resources but rport worker can try accessing freed lport also, so I think fix should be to flush pending rport worker thread before proceeding with freeing EM or lport resource. Vasu _______________________________________________ devel mailing list devel@open-fcoe.org https://lists.open-fcoe.org/mailman/listinfo/devel