I was smoke testing a small cluster when one of the nodes posted this:
 
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
Internal error detected:
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[00]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[01]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[02]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[03]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[04]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[05]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[06]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[07]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[08]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[09]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0a]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0b]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0c]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0d]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0e]: ffffffff
May  7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0:
buf[0f]: ffffffff

At this point, all further IB traffic on that node failed, and it
silently hung during shut down. 
 
Any suggestions as to what I should look at?
 
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
 
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to