On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
> Pradeep Satyanarayana wrote:
> > Pradeep Satyanarayana wrote:
> >> Roland Dreier wrote:
> >>>  > I guess I came to a premature conclusion. One set of tests ran fine 
> >>> and I made that
> >>>  > conclusion. Another set of tests caused the following crash:
> >>>
> >>> I don't really know how to interpret this.  Is this crash new, or is it
> >>> the same crash you were hoping this patch fixed?
> >> This is a new crash.
> > 
> > I see other manifestations resulting in different crashes :
> > 
> > :mon> t
> > [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> > [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
> > [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 
> > [ib_ipoib]
> > [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0
> > [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180
> > [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0
> > [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70
> > 9:mon> e
> > cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720]
> >     pc: c0000000005ac390: ._spin_lock+0x20/0xc8
> >     lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> >     sp: c00000074603b9a0
> >    msr: 8000000000009032
> >    dar: 3a0
> >  dsisr: 40000000
> >   current = 0xc000000756ce8b00
> >   paca    = 0xc000000000f63800
> >     pid   = 18095, comm = ipoib
> > 9:mon>
> 
> Recreating the crash has been tricky. I have tried several several hundred 
> times today
> to unload and reload IPoIB while there is traffic and no crashes happened. I 
> took
> a closer look at the IPoIB CM code and I see a few things that look 
> suspicious.
> 
> In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is 
> held before 
> calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch 
> (fix dangling pointer).
> Is this a potential race?

ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected
by netif_tx_lock(dev) or stopping the ipoib network device.
It all depends on what pointer or data structure you think is being
accessed while free or being modified without the proper protection.

> In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, 
> &tx->flags) in 
> ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's 
> patch) there is a 
> clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags) called before calling 
> ipoib_cm_destroy_tx() only in 
> select cases. Was that intended?

The v4 patch comments explain the changes:
http://www.spinics.net/lists/linux-rdma/msg03733.html
Basically, IPOIB_FLAG_INITIALIZED now means that the struct ipoib_cm_tx
has completed the RC QP creation process via the CM instead of simply
when ipoib_cm_create_tx() allocates the structure.
The test and clear was used to indicate the struct ipoib_cm_tx
had been put on the destroy list and the reaper thread woken up.
Now ipoib_cm_destroy_tx() uses the tx->neigh pointer != NULL to
indicate that ipoib_cm_destroy_tx() has started the destroy process.
ipoib_cm_destroy_tx() is only called when netif_tx_lock() and priv->lock
are held to protect tx->neigh.

> Thanks
> Pradeep

The longer write up on locking is turning out to be very complex.
I will keep working on it but I think it will be just as hard
to understand as slogging through the code.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to