Ramiro Alba Queipo wrote:
Hi all,

I recently had a problem with the server card of an infiniband cluster
which in turn made all the fabric down as the opensm daemon had run
into problems. Running dmesg you could see:

--------------------------------------------------------------------
[408188.411258] ib_mthca 0000:0c:00.0: Catastrophic error detected:
internal error
[408188.411266] ib_mthca 0000:0c:00.0:   buf[00]: 000d0000
[408188.411269] ib_mthca 0000:0c:00.0:   buf[01]: 00000000
[408188.411271] ib_mthca 0000:0c:00.0:   buf[02]: 00000000
[408188.411274] ib_mthca 0000:0c:00.0:   buf[03]: 00000000
[408188.411276] ib_mthca 0000:0c:00.0:   buf[04]: 00000000
[408188.411279] ib_mthca 0000:0c:00.0:   buf[05]: 00127e9c
[408188.411281] ib_mthca 0000:0c:00.0:   buf[06]: ffffffff
[408188.411283] ib_mthca 0000:0c:00.0:   buf[07]: 00000000
[408188.411286] ib_mthca 0000:0c:00.0:   buf[08]: 00000000
[408188.411288] ib_mthca 0000:0c:00.0:   buf[09]: 00000000
[408188.411290] ib_mthca 0000:0c:00.0:   buf[0a]: 00000000
[408188.411292] ib_mthca 0000:0c:00.0:   buf[0b]: 00000000
[408188.411295] ib_mthca 0000:0c:00.0:   buf[0c]: 00000000
[408188.411297] ib_mthca 0000:0c:00.0:   buf[0d]: 00000000
[408188.411299] ib_mthca 0000:0c:00.0:   buf[0e]: 00000000
[408188.411302] ib_mthca 0000:0c:00.0:   buf[0f]: 00000000
------------------------------------------------------------
Problems get solved once I restarted networking. I mean:


Is this a hardware problem? Is there a way to check for a hardware
problem?
It can be a HW problem. I forward this mail to our support people.
You can also submit a request on our support web: http://www.mellanox.com/support/support_signup.php

Tziporet

Tziporet

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to