Hello Sasha, On Sun, Apr 06, 2008 at 06:53:14AM +0000, Sasha Khapyorsky wrote: > On 01:45 Sat 05 Apr , Bernd Schubert wrote: > > > > Hmm, I first increased head_of_queue_lifetime to 0x13 and > > leaf_head_of_queue_lifetime to 0x20, but this didn't make the error > > go away. So I increased head_of_queue_lifetime to 0x15 and > > leaf_head_of_queue_lifetime to 0x50, but this made the fabric to entirely > > crash. > > Are you using default (min hops) routing? I think it could be deadlock > due to unlimited head_of_queue_lifetime values. > > > On the node of the master opensm I got an endless number of messages > > like these: > > > > Apr 5 01:35:03 pfs1n2 kernel: [705448.344542] NETDEV WATCHDOG: ib0: > > transmit timed out > > Apr 5 01:35:03 pfs1n2 kernel: [705448.349814] ib0: transmit timeout: > > latency 411908 msecs > > Apr 5 01:35:03 pfs1n2 kernel: [705448.355364] ib0: queue stopped 1, > > tx_head 441, tx_tail 377 > > Apr 5 01:35:04 pfs1n2 kernel: [705449.343495] NETDEV WATCHDOG: ib0: > > transmit timed out > > > > The slave opensm also went into D-state and is not killable anymore :( > > Interesting... Any more details about this?
unfortunately not. As you may see, it was rather late already and I just wanted to get the entire system working, so I rebooted both nodes running the opensms :( Thanks, Bernd _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
