On 04/04/2013 06:49 PM, Roland Dreier wrote: > > I don't know so much about this PCI error recovery stuff but it does > seem sensible to trigger a catastrophic error async event when it > happens (I'm assuming the recovery mechanism resets the adapter).
The PCI error recovery in the powerpc architecture, which is where I'm focusing, works by identifying a misbehaving adapter and freezing its slot, so that all MMIO writes to that device will be ignored and reads will return all 1's. When that happens the Linux implementation will invoke some callbacks on the driver (in this case mlx4_core) to recover from the error, and reset the slot. The most common procedure is the driver to remove the adapter and add it back, which is what the mlx4_ib is trying to do. > > Then we should fix at least kernel ULPs behave appropriately when they > get such an async event. And similarly if someone wants to harden > some subset of userspace apps to handle PCI error recovery too, that > would be another step forward. > I agree, this seems to be what is missing to have the error recovery fully functional. Thanks, -- Kleber Sacilotto de Souza IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
