Jeff Squyres wrote:
Background: Pasha added a call in the openib BTL finalize function
that will only succeed if all registered memory has been released
(ibv_dealloc_pd()). Since the test app didn't call MPI_FREE_MEM,
there was some memory that was still registered, and therefore the
call in finalize failed. We treated this as a fatal error. Last
night's MTT runs turned up several apps that exhibited this fatal error.
While we're examining this problem, Pasha has removed the call to
ibv_dealloc_pd() in the trunk openib BTL finalize.
I examined 1 of the tests that was failing last night in MTT:
onesided/t.f90. This test has an MPI_ALLOC_MEM with no corresponding
MPI_FREE_MEM. To investigate this problem, I restored the call to
ibv_dealloc_pd() and re-ran the t.f90 test -- the problem still
occurs. Good.
However, once I got the right MPI_FREE_MEM call in t.f90, the test
started passing. I.e., ibv_dealloc_pd(hca->ib_pd) succeeds because
all registered memory has been released. Hence, the test itself was
faulty.
However, I don't think we should *error* if we fail to ibv_dealloc_pd
(hca->ib_pd); it's a user error, but it's not catastrophic unless
we're trying to do an HCA restart scenario. Specifically: during a
normal MPI_FINALIZE, who cares?
I think we should do the following:
1. If we're not doing an HCA restart/checkpoint and we fail to
ibv_dealloc_pd(), just move on (i.e., it's not a warning/error unless
we *want* a warning, such as if an MCA parameter
btl_openib_warn_if_finalize_fail is enabled, or somesuch).
2. If we *are* doing an HCA restart/checkpoint and ibv_dealloc_pd()
fails, then we have to gracefully fail to notify upper layers that
Bad Things happened (I suspect that we need mpool finalize
implemented to properly implement checkpointing for RDMA networks).
3. Add a new MCA parameter named mpi_show_mpi_alloc_mem_leaks that,
when enabled, shows a warning in ompi_mpi_finalize() if there is
still memory allocated by MPI_ALLOC_MEM that was not freed by
MPI_FREE_MEM (this MCA parameter will parallel the already-existing
mpi_show_handle_leaks MCA param which displays warnings if the app
creates MPI objects but does not free them).
My points:
- leaked MPI_ALLOC_MEM memory should be reported by the MPI layer,
not a BTL or mpool
- failing to ibv_dealloc_pd() during MPI_FINALIZE should only trigger
a warning if the user wants to see it
- failing to ibv_dealloc_pd() during an HCA restart or checkpoint
should gracefully fail upwards
Comments?
Agree.
In addition I will add code that will flush all user data from mpool and
will allow normal IB finalization.