I was smoke testing a small cluster when one of the nodes posted this: May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: Internal error detected: May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[00]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[01]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[02]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[03]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[04]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[05]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[06]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[07]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[08]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[09]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0a]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0b]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0c]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0d]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0e]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0f]: ffffffff
At this point, all further IB traffic on that node failed, and it silently hung during shut down. Any suggestions as to what I should look at? -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania
_______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
