copying list on response
Okay - there is apparently an issue with winmad handling device removal (power
exit) while there is an active user. (Everything in the stack has this sort of
issue, btw.) I will need to look at the device removal code to see what the
issue may be.
Winmad does the following during device removal:
void WmRegRemoveHandler(WM_REGISTRATION *pRegistration)
{
ib_port_attr_mod_t port_cap;
if (pRegistration->pDevice == NULL) {
return;
}
if (pRegistration->PortCapMask) {
RtlZeroMemory(&port_cap.cap, sizeof(port_cap.cap));
pRegistration->pDevice->IbInterface.modify_ca(pRegistration->hCa,
pRegistration->PortNum,
pRegistration->PortCapMask,
&port_cap);
}
WmProviderDeregister(pRegistration->pProvider, pRegistration);
pRegistration->pDevice->IbInterface.destroy_qp(pRegistration->hQp,
NULL);
pRegistration->pDevice->IbInterface.dealloc_pd(pRegistration->hPd,
NULL);
pRegistration->pDevice->IbInterface.close_ca(pRegistration->hCa, NULL);
pRegistration->pDevice->IbInterface.close_al(pRegistration->hIbal);
WmIbDevicePut(pRegistration->pDevice);
pRegistration->pDevice = NULL;
}
The expectation was that after all of these calls return, no callbacks are in
progress and no further callbacks will occur.
Can you can try replacing the NULL parameters above in WmRegRemoveHandler with
'ib_sync_destroy'?
Note that I'm not completely convinced that the locking used during device
removal is correct. But I would expect that to lead more to a deadlock
condition than a blue screen.
> -----Original Message-----
> From: Alex Naslednikov [mailto:[email protected]]
> Sent: Sunday, November 21, 2010 6:14 AM
> To: Hefty, Sean
> Subject: [ofw] BSOD at winmad
>
> Hello Sean,
>
> Recently, we got BSOD at winmad driver. I investigated the problem some more
> in depth, and you
> comments are more than welcome
>
>
>
> 1. Callstack:
>
> winmad!WmReceiveHandler+0x45
> [s:\builds\6872\trunk\core\winmad\kernel\wm_provider.c @ 378]
>
> ibbus!__mad_svc_recv_done+0x9a9 [s:\builds\6872\trunk\core\al\al_mad.c @ 2217]
>
> ibbus!mad_disp_recv_done+0x11c6 [s:\builds\6872\trunk\core\al\al_mad.c @ 1016]
>
> ibbus!process_mad_recv+0x2f2 [s:\builds\6872\trunk\core\al\kernel\al_smi.c @
> 2976]
>
> ibbus!spl_qp_comp+0x2a1 [s:\builds\6872\trunk\core\al\kernel\al_smi.c @ 2806]
>
> ibbus!spl_qp_recv_dpc_cb+0xcb [s:\builds\6872\trunk\core\al\kernel\al_smi.c @
> 2674]
>
>
>
> 2. BSOD on null pointer:
>
> WdfObjectAcquireLock(prov->ReadQueue);
>
> if (reg->hService == NULL) {
>
> reg->pDevice->IbInterface.put_mad(pMad); ß
> pDevice == NULL
>
> goto unlock;
>
> }
>
> 3. There are only 2 places where pDevice is set to Null : Init error
> flow (WmRegInit) and
> Destroy(WmRegRemoveHandler)
>
> I can suspect only the second case here, and it our case it happened because
> WmPowerD0Exit() was
> called.
>
> That is, WmPowerD0Exit()->WmProviderRemoveHandler()->WmRegRemoveHandler()
>
> 4. On the other hand, WmReceiveHandler still was not removed .
> Theoretically, it can be caused
> by:
>
> a. Not all WM callbacks were cleaned
>
> b. Receiving of new MADs was stopped, but some MADs that were processed
> so far still trapped into
> WmReceiveHandler
>
>
>
>
>
>
>
> Alexander (XaleX) Naslednikov
>
> SW Networking Team
>
> Mellanox Technologies
>
>
_______________________________________________
ofw mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw