From: Thadeu Lima de Souza Cascardo <[email protected]> Date: Tue, 28 Feb 2012 17:34:38 -0300
> On Tue, Feb 28, 2012 at 02:30:51PM -0500, David Miller wrote: >> From: Thadeu Lima de Souza Cascardo <[email protected]> >> Date: Tue, 28 Feb 2012 15:36:16 -0300 >> >> > When a EEH happens, the catas poll code will try to restart the device, >> > removing it and adding it back again. The EEH code will try to do the >> > same. One of the threads ends up accessing memory that was freed by the >> > other thread and we get a crash. >> >> Stop adding bandaids to the locking. >> >> If the EEH infrastructure doesn't synchronize parallel operations >> on the same device, that is the real bug, and that's where the real >> fix belongs. >> >> I refuse to apply this patch. >> > > It's not EEH that does not synchronize removal. The problem is that the > driver itself calls the driver remove function through mlx4_restart_one. Then reuse the existing intf_mutex this driver has, export it to main.c and add a new __mlx4_unregister_device that can be called with the intf_mutex held already. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
