On May 5, 2008, at 6:27 PM, Steve Wise wrote:
I am seeing some unusual behavior during the shutdown phase of ompi
at the end of my testcase. While running a IMB pingpong test over
the rdmacm on openib, I get cq flush errors on my iWARP adapters.
This error is happening because the remote node is still polling
the endpoint while the other one shutdown. This occurs because
iWARP puts the qps in error state when the channel is disconnected
(IB does not do this). Since the cq is still being polled when the
event is received on the remote node, ompi thinks it hit an error
and kills the run. Since this is expected behavior on iWARP, this
is not really an error case.
The key here, I think is that when an iWARP QP moves out of RTS, all
the
RECVs and any pending SQ WRs get flushed. Further, disconnecting the
iwarp connection forces the QP out of RTS. This is probably different
than they way IB works. IE "disconnecting" in IB is an out-of-band
exchange done by the IBCM. For iWARP, "disconnecting" is an in-band
operation (a TCP close or abort) so the QP cannot remain in RTS during
this process.
Let me make sure I understand:
- proc A calls del_procs on proc B
- proc A calls ibv_destroy_qp() on QP to proc B
- this causes a local (proc A) flush on all pending receives and SQ WRs
- this then causes a FLUSH event to show up *in proc B*
--> I'm not clear on this point from Jon's/Steve's text
- OMPI [currently] treats the FLUSH in proc B as an error
Is that right?
What is the purpose of the FLUSH event?
There is a larger question regarding why the remote node is still
polling the hca and not shutting down, but my immediate question is
if it is an acceptable fix to simply disregard this "error" if it
is an iWARP adapter.
If proc B is still polling the hca, it is likely because it simply has
not yet stopped doing it. I.e., a big problem in MPI implementations
is that not all actions are exactly synchronous. MPI disconnects are
*effectively* synchronous, but we probably didn't *guarantee*
synchronicity in this case because we didn't need it (perhaps until
now).
Opinions?
If the openib btl (or the layers above) assume the "disconnect" will
notify the remote rank that the connection should be finalized, then
we
must deal with FLUSHED WRs for the iwarp case. If some sort of
"finalizing" is done by OMPI and then the connections disconnected,
then
that "finalizing" should include not polling the CQ anymore. But
that's
not what we observe.
I'd have to check the exact shutdown sequence...
--
Jeff Squyres
Cisco Systems