Soliciting input from the community:
WHAT: Modify PML cm component to remove unnecessary initializations, optimizing blocking operations WHY: Remove overhead in fast-path by allowing a "direct mode" increases single packet latency HOW: In PML cm, even if the request starts and ends within the scope of the blocking send/recv function, A full request, a structure of up to 488 bytes (not including the MTL request appendix size) may be initialized. The request includes the opmi_request_t structure, used by an underlying MTL component, the converter which corresponds to the datatype and other parameters - some of which are stored and only used if the request is asynchronous. This causes a significant amount of writes, especially when considering the send buffer could be as small as several bytes. The proposed patch introduces a "direct mode" (currently set iff the underlying MTL is "mxm", which is the only option I had available for testing), which when on cuts most of the initialization for blocking send and receive operations to include only the bare minimum required to function. Aside from initializing only a part of the request structure (field like "dst" and "tag" are passed again to the MTL_CALL macro rather than use the request struct anyway), the function uses a single pre-allocated request buffer - which is possible since the call is blocking. Our tests show that this increases packet rate by approximately 20% with 8-byte buffers. Note that the "redundant" if-conditions for irrelevant functions (e.g. recv_init) are removed by compiler, since the macro substitutes and gets "if (0 == 0)". WHERE: Most of the files in ompi/mca/pml/cm . WHEN: ? Joshua S. Ladd, PhD HPC Algorithms Engineer Mellanox Technologies Email: josh...@mellanox.com<mailto:josh...@mellanox.com> Cell: +1 (865) 258 - 8898