On 04/04/2013 06:49 PM, Roland Dreier wrote:
> 
> I don't know so much about this PCI error recovery stuff but it does
> seem sensible to trigger a catastrophic error async event when it
> happens (I'm assuming the recovery mechanism resets the adapter).

The PCI error recovery in the powerpc architecture, which is where I'm
focusing, works by identifying a misbehaving adapter and freezing its
slot, so that all MMIO writes to that device will be ignored and reads
will return all 1's. When that happens the Linux implementation will
invoke some callbacks on the driver (in this case mlx4_core) to recover
from the error, and reset the slot. The most common procedure is the
driver to remove the adapter and add it back, which is what the mlx4_ib
is trying to do.

> 
> Then we should fix at least kernel ULPs behave appropriately when they
> get such an async event.  And similarly if someone wants to harden
> some subset of userspace apps to handle PCI error recovery too, that
> would be another step forward.
> 

I agree, this seems to be what is missing to have the error recovery
fully functional.


Thanks,

-- 
Kleber Sacilotto de Souza
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to