On 01:45 Sat 05 Apr     , Bernd Schubert wrote:
> 
> Hmm, I first increased head_of_queue_lifetime to 0x13 and 
> leaf_head_of_queue_lifetime to 0x20, but this didn't make the error 
> go away. So I increased head_of_queue_lifetime to 0x15 and 
> leaf_head_of_queue_lifetime  to 0x50, but this made the fabric to entirely
> crash.

Are you using default (min hops) routing? I think it could be deadlock
due to unlimited head_of_queue_lifetime values.

> On the node of the master opensm I got an endless number of messages
> like these:
> 
> Apr  5 01:35:03 pfs1n2 kernel: [705448.344542] NETDEV WATCHDOG: ib0: transmit 
> timed out
> Apr  5 01:35:03 pfs1n2 kernel: [705448.349814] ib0: transmit timeout: latency 
> 411908 msecs
> Apr  5 01:35:03 pfs1n2 kernel: [705448.355364] ib0: queue stopped 1, tx_head 
> 441, tx_tail 377
> Apr  5 01:35:04 pfs1n2 kernel: [705449.343495] NETDEV WATCHDOG: ib0: transmit 
> timed out
> 
> The slave opensm also went into D-state and is not killable anymore :(

Interesting... Any more details about this?

Sasha
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to