I will write up a description of the locking as I understand it and the changes I made. Give me a day or two to write it up and check it.
On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote: > Pradeep Satyanarayana wrote: > > Pradeep Satyanarayana wrote: > >> Roland Dreier wrote: > >>> > I guess I came to a premature conclusion. One set of tests ran fine > >>> and I made that > >>> > conclusion. Another set of tests caused the following crash: > >>> > >>> I don't really know how to interpret this. Is this crash new, or is it > >>> the same crash you were hoping this patch fixed? > >> This is a new crash. > > > > I see other manifestations resulting in different crashes : > > > > :mon> t > > [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] > > [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib] > > [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 > > [ib_ipoib] > > [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0 > > [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180 > > [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0 > > [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70 > > 9:mon> e > > cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720] > > pc: c0000000005ac390: ._spin_lock+0x20/0xc8 > > lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] > > sp: c00000074603b9a0 > > msr: 8000000000009032 > > dar: 3a0 > > dsisr: 40000000 > > current = 0xc000000756ce8b00 > > paca = 0xc000000000f63800 > > pid = 18095, comm = ipoib > > 9:mon> > > Recreating the crash has been tricky. I have tried several several hundred > times today > to unload and reload IPoIB while there is traffic and no crashes happened. I > took > a closer look at the IPoIB CM code and I see a few things that look > suspicious. > > In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is > held before > calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch > (fix dangling pointer). > Is this a potential race? > > In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, > &tx->flags) in > ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's > patch) there is a > clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags) called before calling > ipoib_cm_destroy_tx() only in > select cases. Was that intended? > > Thanks > Pradeep > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
