Yo folks
Does anyone have a suggestion as to what might be causing this? It's
in 1.2.4 release, if that helps. We are trying to test the cluster, so
it could be hardware problems - we just want to narrow it down if we
can. Any debug suggestions would also be welcome.
Thanks
Ralph
Begin forwarded message:
From: Craig Idler <c...@lanl.gov>
Date: August 28, 2008 9:43:11 AM MDT
To: tlcc-inst...@lanl.gov
Cc: Trent D'Hooge <tdho...@llnl.gov>
Subject: error on QCD run
I've seen the following error a couple of times now during a QCD
multi-node run. Does this indicate a MPI driver issue or maybe a IB
network problem?
--------
Input file generated. Current time is: Thu Aug 28 00:38:47 2008 UTC
Starting executable preplat via "mpirun -np 512 ./preplat"
[0,1,452][btl_openib_component.c:1338:btl_openib_component_progress]
from loa126 to: loa119 error polling HP CQ with status LOCAL QP
OPERATION ERROR s
tatus number 2 for wr_id 141710328 opcode -1
mlx4: local QP operation err (QPN 8800ae, WQE index bfab0000, vendor
syndrome 6f, opcode = 5e)
mpirun noticed that job rank 0 with PID 10676 on node loa031 exited
on signal 15 (Terminated).
510 additional processes aborted (not shown)
mpirun finished with code 36608
--------
Thanks for any insight.
Craig