What: Completely revamp the BTL RDMA interface (btl_put, btl_get) to
better match what is needed for MPI one-sided.

Why: I am preparing to push an enhanced MPI-3 one-sided component that
makes use of network rdma and atomic operations to provide a fast truely
one-sided implementation. Before I can push this component I want to
change the btl interface to:

 - Provide access to network atomic operations. I only need add and
   cswap but the interface can be extended to any number of operations.

   The new interface provides three new functions: btl_atomic_op,
   btl_atomic_fop, and btl_atomic_cswap. Additionally there are two new
   btl_flags to indicate available atomic support:
   MCA_BTL_FLAGS_ATOMIC_OPS, and MCA_BTL_FLAGS_ATOMIC_FOPS. The
   btl_atomics_flags field has been added to indicate which atomic
   operations are supported (see mca_btl_base_atomic_op_t). At this time
   I only added support for 64-bit integer atomics but I am open to
   adding support for 32-bit as well.


 - Provide an interface that will allow simultaneous put/get operations
   without extra calls into the btl. The current interface requires the
   btl user to call prepare_src/prepare_dst before every rdma
   operation. In some cases this is a complete waste (vader, sm with
   CMA, knem, or xpmem).

   I seperated the registration of memory from the segment info. More
   information is provided below. The new put/get functions have the
   following signatures:


typedef int (*mca_btl_base_module_put_fn_t) (struct mca_btl_base_module_t *btl,
    struct mca_btl_base_endpoint_t *endpoint, void *local_address,
    uint64_t remote_address, struct mca_btl_base_registration_handle_t 
*local_handle,
    struct mca_btl_base_registration_handle_t *remote_handle, size_t size, int 
flags,
    int order, mca_btl_base_rdma_completion_fn_t cbfunc, void *cbcontext, void 
*cbdata);

typedef int (*mca_btl_base_module_get_fn_t) (struct mca_btl_base_module_t *btl,
    struct mca_btl_base_endpoint_t *endpoint, void *local_address,
    uint64_t remote_address, struct mca_btl_base_registration_handle_t 
*local_handle,
    struct mca_btl_base_registration_handle_t *remote_handle, size_t size, int 
flags,
    int order, mca_btl_base_rdma_completion_fn_t cbfunc, void *cbcontext, void 
*cbdata);

typedef void (*mca_btl_base_rdma_completion_fn_t)(
    struct mca_btl_base_module_t* module,
    struct mca_btl_base_endpoint_t* endpoint,
    void *local_address,
    struct mca_btl_base_registration_handle_t *local_handle,
    void *context,
    void *cbdata,
    int status);

   I may modify the completion function to provide more information on
   the completed operation (size).


 - Allow the registration of an entire region even if the region can not
   be modified with a single rdma operation. At this time prepare_src
   and prepare_dst may modify the size and register a smaller
   region. This will not work.

   This is done in the new interface through the new btl_register_mem,
   and btl_deregister_mem interfaces. The btl_register_mem interface
   returns a registration handle of size btl_registration_handle_size
   that can be used as either the local_handle or remote_handle to any
   rdma/atomic function. BTLs that do not provide these functions do not
   require registration for rdma/atomic operations.

typedef struct mca_btl_base_registration_handle_t 
*(*mca_btl_base_module_register_mem_fn_t)(
    struct mca_btl_base_module_t* btl, struct mca_btl_base_endpoint_t 
*endpoint, void *base,
    size_t size, uint32_t flags);

typedef struct mca_btl_base_registration_handle_t 
*(*mca_btl_base_module_register_mem_fn_t)(
    struct mca_btl_base_module_t* btl, struct mca_btl_base_endpoint_t 
*endpoint, void *base,
    size_t size, uint32_t flags);


 - Expose the limitations of the put and get operations so the caller
   can make decisions before trying a get or put operation. Two
   examples: the Gemini interconnect has an alignment restriction on
   get, openib devices may have a limit on how large a single get/put
   operation can be. The current interface sort of gives the put limit
   but it is tied to the rdma pipeline protocol.

   This is done in the new interface by providing btl_get_limit,
   btl_get_alignment, btl_put_limit, and btl_put_alignment. Operations
   that violate these restrictions should return OPAL_ERR_BAD_PARAM
   (operation over limit) or OPAL_ERR_NOT_SUPPORTED (operation not
   supported due to alignment restructions with either the source or
   destination buffer).

This is a big change and I do not expect everyone to like 100% of these
changes. I welcome any feedback people have.


When: Tuesday, Nov 17, 2015. This is during SC so there will be time for
face-to-face discussion if anyone has any concerns or would like to see
something changed.



The proposed new btl interface as well as updated versions of: pml/ob1,
btl/openib, btl/self, btl/scif, btl/sm, btl/tcp, btl/ugni, and btl/vader
can be found in my btlmod branch at:

https://github.com/hjelmn/ompi/tree/btlmod

Other btls (smcuda, and usnic) still need to be updated to provide the
new interface. Unmodified btl will not build.

If there are no objections I will push the btl modifications into the
master two weeks from today (Nov 17). Please take a look and let me know
what you think.

Attachment: pgplEVNtA0cGA.pgp
Description: PGP signature

Reply via email to