Not sure if this is the correct forum, but we are experiencing problems with IB
when running a commercial CFD code which is causing jobs to crash with the
following errors. Could someone explain what is the likely cause of these and
how we can minimise their occurrence.
Thanks Wayne
starccm+: Rank 0:172: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:172: MPI_Test: self cfd-cnsl-0230 peer cfd-cnsl-0144 (rank:
219)
starccm+: Rank 0:172: MPI_Test: error message: transport retry exceeded error
Error: {'In': ['Machine::main', 'SimulationIterator::startIterating',
'SteadySolver::step', 'SegregatedFlowSolver::iterationUpdate'], 'Neo.Error':
'Error', 'Processor': 172, 'Severity': 'EXCEPTION', 'message': 'MPI Error :
MPI_Test: Internal MPI error'}Synchronizing parallel nodes (attempt 0)
starccm+: Rank 0:71: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:68: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:71: MPI_Test: self cfd-cnsl-0196 peer cfd-cnsl-0214 (rank: 92)
starccm+: Rank 0:71: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:68: MPI_Test: self cfd-cnsl-0196 peer cfd-cnsl-0214 (rank: 93)
starccm+: Rank 0:68: MPI_Test: error message: transport retry exceeded error
Error: {'In': ['Machine::main', 'SimulationIterator::startIterating',
'SteadySolver::step', 'SegregatedFlowSolver::iterationUpdate',
'AMGLinearSolver::solve'], 'Neo.Error': 'Error', 'Processor': 71, 'Severity':
'EXCEPTION', 'message': 'MPI Error : MPI_Test: Internal MPI error'}
Synchronizing parallel nodes (attempt 0)
starccm+: Rank 0:68: MPI_Gather: ibv_poll_cq(): bad status 5
starccm+: Rank 0:68: MPI_Gather: self cfd-cnsl-0196 peer cfd-cnsl-0214 (rank:
93)
starccm+: Rank 0:68: MPI_Gather: error message: work request flushed error
starccm+: Rank 0:71: MPI_Gather: ibv_poll_cq(): bad status 12
starccm+: Rank 0:71: MPI_Gather: self cfd-cnsl-0196 peer cfd-cnsl-0214 (rank:
91)
starccm+: Rank 0:71: MPI_Gather: error message: transport retry exceeded error
/apps/CFD/CD-ADAPCO/Linux/starccm+3.04.008/star/bin/starenv: line 961: 5745
Segmentation fault "$@"
starccm+: Rank 0:118: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:46: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:42: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:118: MPI_Test: self cfd-cnsl-0408 peer cfd-cnsl-0452 (rank:
229)
starccm+: Rank 0:118: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:42: MPI_Test: self cfd-cnsl-0271 peer cfd-cnsl-0452 (rank: 229)
starccm+: Rank 0:42: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:46: MPI_Test: self cfd-cnsl-0271 peer cfd-cnsl-0452 (rank: 228)
starccm+: Rank 0:46: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:86: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:87: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:93: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:244: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:244: MPI_Test: self cfd-cnsl-0342 peer cfd-cnsl-0257 (rank: 26)
starccm+: Rank 0:244: MPI_Test: error message: transport retry exceeded error
Error: {'In': ['Machine::main', 'SimulationIterator::startIterating',
'SteadySolver::step', 'RsTurbSolver::iterationUpdate'], 'Neo.Error': 'Error',
'Processor': 244, 'Severity': 'EXCEPTION', 'message': 'MPI Error : MPI_Test:
Internal MPI error'}
Synchronizing parallel nodes (attempt 0)
starccm+: Rank 0:26: MPI_Cancel: ibv_poll_cq(): bad status 12
starccm+: Rank 0:26: MPI_Cancel: self cfd-cnsl-0257 peer cfd-cnsl-0342 (rank:
244)
starccm+: Rank 0:26: MPI_Cancel: error message: transport retry exceeded error
starccm+: Rank 0:244: MPI_Cancel: ibv_poll_cq(): bad status 5
starccm+: Rank 0:244: MPI_Cancel: self cfd-cnsl-0342 peer cfd-cnsl-0257 (rank:
26)
starccm+: Rank 0:244: MPI_Cancel: error message: work request flushed error
starccm+: Rank 0:244: MPI_Cancel: MPI BUG: no requests done
/apps/CFD/CD-ADAPCO/Linux/starccm+3.04.008/star/bin/starenv: line 961: 5729
Segmentation fault "$@"
MPI Application rank 244 exited before MPI_Finalize() with status 139
hung
starccm+: Rank 0:58: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:57: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:57: MPI_Test: self cfd-cnsl-0401 peer cfd-cnsl-0448 (rank: 40)
starccm+: Rank 0:57: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:58: MPI_Test: self cfd-cnsl-0401 peer cfd-cnsl-0448 (rank: 42)
starccm+: Rank 0:58: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:72: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:72: MPI_Test: self cfd-cnsl-0371 peer cfd-cnsl-0277 (rank: 1)
starccm+: Rank 0:72: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:74: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:74: MPI_Test: self cfd-cnsl-0371 peer cfd-cnsl-0277 (rank: 1)
starccm+: Rank 0:74: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:75: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:75: MPI_Test: self cfd-cnsl-0371 peer cfd-cnsl-0448 (rank: 40)
starccm+: Rank 0:75: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:26: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:29: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:29: MPI_Test: self cfd-cnsl-0349 peer cfd-cnsl-0418 (rank: 252)
starccm+: Rank 0:29: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:26: MPI_Test: self cfd-cnsl-0349 peer cfd-cnsl-0418 (rank: 254)
starccm+: Rank 0:26: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:134: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:129: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:135: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:131: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:130: MPI_Test: ibv_poll_cq(): bad status 12
starccm+: Rank 0:134: MPI_Test: self cfd-cnsl-0386 peer cfd-cnsl-0418 (rank:
250)
starccm+: Rank 0:134: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:131: MPI_Test: self cfd-cnsl-0386 peer cfd-cnsl-0418 (rank:
255)
starccm+: Rank 0:131: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:130: MPI_Test: self cfd-cnsl-0386 peer cfd-cnsl-0418 (rank:
254)
starccm+: Rank 0:130: MPI_Test: error message: transport retry exceeded error
starccm+: Rank 0:129: MPI_Test: self cfd-cnsl-0386 peer cfd-cnsl-0418 (rank:
254)
starccm+: Rank 0:129: MPI_Test: error message: transport retry exceeded error
---------------------------------------------------------------------
For further information on the Renault F1 Team visit our web site at
www.renaultf1.com.
Renault F1 Team Limited
Registered in England no. 1806337
Registered Office: 16 Old Bailey London EC4M 7EG
WARNING: please ensure that you have adequate virus protection in place before
you open or detach any documents attached to this email.
This e-mail may constitute privileged information. If you are not the intended
recipient, you have received this confidential email and any attachments
transmitted with it in error and you must not disclose copy, circulate or in
any other way use or rely on this information.
E-mails to and from the Renault F1 Team are monitored for operational reasons
and in accordance with lawful business practices.
The contents of this email are those of the individual and do not necessarily
represent the views of the company.
Please note that this e-mail has been created in the knowledge that Internet
e-mail is not a 100% secure communications medium. We advise that you
understand and observe this lack of security when e-mailing us.
If you have received this email in error please forward to:
[email protected] quoting the sender, then delete the message and
any attached documents
---------------------------------------------------------------------
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general