Ramiro Alba Queipo wrote:
Hi all,
I recently had a problem with the server card of an infiniband cluster
which in turn made all the fabric down as the opensm daemon had run
into problems. Running dmesg you could see:
--------------------------------------------------------------------
[408188.411258] ib_mthca 0000:0c:00.0: Catastrophic error detected:
internal error
[408188.411266] ib_mthca 0000:0c:00.0: buf[00]: 000d0000
[408188.411269] ib_mthca 0000:0c:00.0: buf[01]: 00000000
[408188.411271] ib_mthca 0000:0c:00.0: buf[02]: 00000000
[408188.411274] ib_mthca 0000:0c:00.0: buf[03]: 00000000
[408188.411276] ib_mthca 0000:0c:00.0: buf[04]: 00000000
[408188.411279] ib_mthca 0000:0c:00.0: buf[05]: 00127e9c
[408188.411281] ib_mthca 0000:0c:00.0: buf[06]: ffffffff
[408188.411283] ib_mthca 0000:0c:00.0: buf[07]: 00000000
[408188.411286] ib_mthca 0000:0c:00.0: buf[08]: 00000000
[408188.411288] ib_mthca 0000:0c:00.0: buf[09]: 00000000
[408188.411290] ib_mthca 0000:0c:00.0: buf[0a]: 00000000
[408188.411292] ib_mthca 0000:0c:00.0: buf[0b]: 00000000
[408188.411295] ib_mthca 0000:0c:00.0: buf[0c]: 00000000
[408188.411297] ib_mthca 0000:0c:00.0: buf[0d]: 00000000
[408188.411299] ib_mthca 0000:0c:00.0: buf[0e]: 00000000
[408188.411302] ib_mthca 0000:0c:00.0: buf[0f]: 00000000
------------------------------------------------------------
Problems get solved once I restarted networking. I mean:
Is this a hardware problem? Is there a way to check for a hardware
problem?
It can be a HW problem. I forward this mail to our support people.
You can also submit a request on our support web:
http://www.mellanox.com/support/support_signup.php
Tziporet
Tziporet
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general