Last few days i get random errors in mpich communication on my cluster with typical error message looking like this:
p2_2983: p4_error: socket_recv_on_fd: invalid data type %d Since such messages appears only recently, quite randomly and after several hours of computation I guess that the reason is some hardware problem. I tried some tests (like prime95) on all the nodes but found nothing so far. The most suspected component is the switch or maybe cabels now. Does anyone know about some tool that could put heavy load on my internal cluster network and test whether communication is ok? Pavel Jurus

