Hello, I have here small 8-nodes PC cluster and we got MPICH error messages that looks like this one:
p7_13798: p4_error: socket_recv_on_fd: invalid data type %d : 6 Such error occurs quite randomly once in 10-40 hours of computational time. The same software runs well on another cluster so I suspected hardware first, but we tried to exchange some nodes first (error seems to occur randomly on all nodes) and even master computer with no success. Last suspicious hardware component are the cables and hub but I'm not sure how to test those for such random error. I managed to google out some old references of the same error here: http://www.beowulf.org/pipermail/beowulf/2001-August/000957.html that hints that the problem might be perhaps with the MPICH. We use here mpich 1.2.2 which is one of the few packages taken directly from upstream and not from Debian. If I remember correctly the reason is that we had some problems getting Debian's mpich running together with PGI fortran compiler (which is the one that we have to use here). I would be happy to hear any idea where the problem could be, what else to check, or whether someone else have already seen this one error ... Pavel -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

