Sean,
I don't sync that ib_sync_destroy() will help - please, see the scenario below 
(and correct me if I wrong)

1. WmRegRemoveHandler sets pRegistration->pDevice = NULL;
2. WmReceiveHandler() uses pReg->pDevice 
3. The above callback was init at WmRegInit():


        svc.mad_svc_context = pRegistration;
        svc.pfn_mad_send_cb = WmSendHandler;
        svc.pfn_mad_recv_cb = WmReceiveHandler;
        svc.support_unsol = WmConvertMethods(&svc, pAttributes);
        svc.mgmt_class = pAttributes->Class;
        svc.mgmt_version = pAttributes->Version;
        svc.svc_type = IB_MAD_SVC_DEFAULT;

        ib_status = dev->IbInterface.reg_mad_svc(pRegistration->hQp, &svc,
                                                                                
         &pRegistration->hService);

4. How can we ensure that this callback was removed before we cleared the 
pDevice pointer?
I.e., I am looking for something like call to dereg_mad_svc

5. Otherwise, such callback can occur even after we cleared the device pointer

-----Original Message-----
From: Hefty, Sean [mailto:[email protected]] 
Sent: Monday, November 22, 2010 7:39 PM
To: Alex Naslednikov
Cc: [email protected]
Subject: RE: [ofw] BSOD at winmad

copying list on response

Okay - there is apparently an issue with winmad handling device removal (power 
exit) while there is an active user.  (Everything in the stack has this sort of 
issue, btw.)  I will need to look at the device removal code to see what the 
issue may be.

Winmad does the following during device removal:

void WmRegRemoveHandler(WM_REGISTRATION *pRegistration)
{
        ib_port_attr_mod_t      port_cap;

        if (pRegistration->pDevice == NULL) {
                return;
        }

        if (pRegistration->PortCapMask) {
                RtlZeroMemory(&port_cap.cap, sizeof(port_cap.cap));
                
pRegistration->pDevice->IbInterface.modify_ca(pRegistration->hCa,
                                                                                
                          pRegistration->PortNum,
                                                                                
                          pRegistration->PortCapMask,
                                                                                
                          &port_cap);
        }

        WmProviderDeregister(pRegistration->pProvider, pRegistration);
        pRegistration->pDevice->IbInterface.destroy_qp(pRegistration->hQp, 
NULL);
        pRegistration->pDevice->IbInterface.dealloc_pd(pRegistration->hPd, 
NULL);
        pRegistration->pDevice->IbInterface.close_ca(pRegistration->hCa, NULL);
        pRegistration->pDevice->IbInterface.close_al(pRegistration->hIbal);

        WmIbDevicePut(pRegistration->pDevice);
        pRegistration->pDevice = NULL;
}

The expectation was that after all of these calls return, no callbacks are in 
progress and no further callbacks will occur.
 
Can you can try replacing the NULL parameters above in WmRegRemoveHandler with 
'ib_sync_destroy'?

Note that I'm not completely convinced that the locking used during device 
removal is correct.  But I would expect that to lead more to a deadlock 
condition than a blue screen.

> -----Original Message-----
> From: Alex Naslednikov [mailto:[email protected]]
> Sent: Sunday, November 21, 2010 6:14 AM
> To: Hefty, Sean
> Subject: [ofw] BSOD at winmad
> 
> Hello Sean,
> 
> Recently, we got BSOD at winmad driver. I investigated the problem some more 
> in depth, and you
> comments are more than welcome
> 
> 
> 
> 1.       Callstack:
> 
> winmad!WmReceiveHandler+0x45 
> [s:\builds\6872\trunk\core\winmad\kernel\wm_provider.c @ 378]
> 
> ibbus!__mad_svc_recv_done+0x9a9 [s:\builds\6872\trunk\core\al\al_mad.c @ 2217]
> 
> ibbus!mad_disp_recv_done+0x11c6 [s:\builds\6872\trunk\core\al\al_mad.c @ 1016]
> 
> ibbus!process_mad_recv+0x2f2 [s:\builds\6872\trunk\core\al\kernel\al_smi.c @ 
> 2976]
> 
> ibbus!spl_qp_comp+0x2a1 [s:\builds\6872\trunk\core\al\kernel\al_smi.c @ 2806]
> 
> ibbus!spl_qp_recv_dpc_cb+0xcb [s:\builds\6872\trunk\core\al\kernel\al_smi.c @ 
> 2674]
> 
> 
> 
> 2.       BSOD on null pointer:
> 
> WdfObjectAcquireLock(prov->ReadQueue);
> 
>                 if (reg->hService == NULL) {
> 
>                                 reg->pDevice->IbInterface.put_mad(pMad);  ß 
> pDevice == NULL
> 
>                                 goto unlock;
> 
>                 }
> 
> 3.       There are only 2 places where pDevice is set to Null : Init error 
> flow (WmRegInit)  and
> Destroy(WmRegRemoveHandler)
> 
> I can suspect only the second case here, and it our case it happened because 
> WmPowerD0Exit() was
> called.
> 
> That is, WmPowerD0Exit()->WmProviderRemoveHandler()->WmRegRemoveHandler()
> 
> 4.       On the other hand, WmReceiveHandler still was not removed . 
> Theoretically, it can be caused
> by:
> 
> a.       Not all WM callbacks were cleaned
> 
> b.      Receiving of new MADs was stopped, but some MADs that were processed 
> so far still trapped into
> WmReceiveHandler
> 
> 
> 
> 
> 
> 
> 
> Alexander (XaleX) Naslednikov
> 
> SW Networking Team
> 
> Mellanox Technologies
> 
> 

_______________________________________________
ofw mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw

Reply via email to