What: Completely revamp the BTL RDMA interface (btl_put, btl_get) to better match what is needed for MPI one-sided.
Why: I am preparing to push an enhanced MPI-3 one-sided component that makes use of network rdma and atomic operations to provide a fast truely one-sided implementation. Before I can push this component I want to change the btl interface to: - Provide access to network atomic operations. I only need add and cswap but the interface can be extended to any number of operations. The new interface provides three new functions: btl_atomic_op, btl_atomic_fop, and btl_atomic_cswap. Additionally there are two new btl_flags to indicate available atomic support: MCA_BTL_FLAGS_ATOMIC_OPS, and MCA_BTL_FLAGS_ATOMIC_FOPS. The btl_atomics_flags field has been added to indicate which atomic operations are supported (see mca_btl_base_atomic_op_t). At this time I only added support for 64-bit integer atomics but I am open to adding support for 32-bit as well. - Provide an interface that will allow simultaneous put/get operations without extra calls into the btl. The current interface requires the btl user to call prepare_src/prepare_dst before every rdma operation. In some cases this is a complete waste (vader, sm with CMA, knem, or xpmem). I seperated the registration of memory from the segment info. More information is provided below. The new put/get functions have the following signatures: typedef int (*mca_btl_base_module_put_fn_t) (struct mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint, void *local_address, uint64_t remote_address, struct mca_btl_base_registration_handle_t *local_handle, struct mca_btl_base_registration_handle_t *remote_handle, size_t size, int flags, int order, mca_btl_base_rdma_completion_fn_t cbfunc, void *cbcontext, void *cbdata); typedef int (*mca_btl_base_module_get_fn_t) (struct mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint, void *local_address, uint64_t remote_address, struct mca_btl_base_registration_handle_t *local_handle, struct mca_btl_base_registration_handle_t *remote_handle, size_t size, int flags, int order, mca_btl_base_rdma_completion_fn_t cbfunc, void *cbcontext, void *cbdata); typedef void (*mca_btl_base_rdma_completion_fn_t)( struct mca_btl_base_module_t* module, struct mca_btl_base_endpoint_t* endpoint, void *local_address, struct mca_btl_base_registration_handle_t *local_handle, void *context, void *cbdata, int status); I may modify the completion function to provide more information on the completed operation (size). - Allow the registration of an entire region even if the region can not be modified with a single rdma operation. At this time prepare_src and prepare_dst may modify the size and register a smaller region. This will not work. This is done in the new interface through the new btl_register_mem, and btl_deregister_mem interfaces. The btl_register_mem interface returns a registration handle of size btl_registration_handle_size that can be used as either the local_handle or remote_handle to any rdma/atomic function. BTLs that do not provide these functions do not require registration for rdma/atomic operations. typedef struct mca_btl_base_registration_handle_t *(*mca_btl_base_module_register_mem_fn_t)( struct mca_btl_base_module_t* btl, struct mca_btl_base_endpoint_t *endpoint, void *base, size_t size, uint32_t flags); typedef struct mca_btl_base_registration_handle_t *(*mca_btl_base_module_register_mem_fn_t)( struct mca_btl_base_module_t* btl, struct mca_btl_base_endpoint_t *endpoint, void *base, size_t size, uint32_t flags); - Expose the limitations of the put and get operations so the caller can make decisions before trying a get or put operation. Two examples: the Gemini interconnect has an alignment restriction on get, openib devices may have a limit on how large a single get/put operation can be. The current interface sort of gives the put limit but it is tied to the rdma pipeline protocol. This is done in the new interface by providing btl_get_limit, btl_get_alignment, btl_put_limit, and btl_put_alignment. Operations that violate these restrictions should return OPAL_ERR_BAD_PARAM (operation over limit) or OPAL_ERR_NOT_SUPPORTED (operation not supported due to alignment restructions with either the source or destination buffer). This is a big change and I do not expect everyone to like 100% of these changes. I welcome any feedback people have. When: Tuesday, Nov 17, 2015. This is during SC so there will be time for face-to-face discussion if anyone has any concerns or would like to see something changed. The proposed new btl interface as well as updated versions of: pml/ob1, btl/openib, btl/self, btl/scif, btl/sm, btl/tcp, btl/ugni, and btl/vader can be found in my btlmod branch at: https://github.com/hjelmn/ompi/tree/btlmod Other btls (smcuda, and usnic) still need to be updated to provide the new interface. Unmodified btl will not build. If there are no objections I will push the btl modifications into the master two weeks from today (Nov 17). Please take a look and let me know what you think.
pgplEVNtA0cGA.pgp
Description: PGP signature