[EMAIL PROTECTED] wrote: > On Thu, Nov 06, 2008 at 05:12:50PM +0200, Jack Morgenstein wrote: >> On Thursday 06 November 2008 03:23, [EMAIL PROTECTED] wrote: >>> I described an IPoIB-related panic we were seeing on large >>> clusters. The signature was a backtrace like this: >>> >>> skb_over_panic >>> :ib_ipoib:ipoib_ib_handle_rx_wc >>> :ib_ipoib:ipoib_poll >>> net_rx_action >>> ..... >>> >>> The bug is difficult to reproduce, but we finally got a crashdump, >>> and the problem appears to be that stale skb pointers on the tx_ring >>> were left pointing to skbs that had been since reused, so that the >>> skb's data region was now unexpectedly short, etc. >>> >> How does ipoib_ib_handle_rx_wc() involve the tx_ring? This is >> receive processing. >> > > What I surmise may be happening is something like this: > > - tx skb is freed, but a stale pointer remains on tx_ring > - the same skb is reallocated, and added to the rx_ring > - now we get an 'unexpected' tx completion, and use the stale > skb pointer on the tx_ring to again free the skb (this step > seems to invoke a f/w bug) > - another driver, say an ethernet driver, reallocates the skb, > reducing the extent of the data region (leading to the > skb_over_panic once it's processed by ipoib_ib_handle_rx_wc) > > > This bug leaves the tx and rx rings corrupted in many ways, > including: > > - different rx_ring members refer to the same skb > - different skbs on the rx_ring have identical data, head, end, tail ptrs > - skbs on the rx_ring have sizes inconsistent with what the ipoib > driver allocates (which causes the skb_over_panic, of course) > - rx skbs have 'dev' pointers to ethernet devices > - dma mappings in rx_ring aren't consistent with what's in skb > - some skbs are simultaneously on the tx and rx rings
If I am not mistaken we saw a problem that showed similar characteristics more than two years ago on IBM platforms. The same issue of rx_ring reusing tx_ring skbs and so on and would show up only under stress. This was with UD mode (before CM came into the picture) and it turned out to be a driver issue. Could that be the same here? Pradeep _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
