shamisp commented on pull request #12442:
URL: https://github.com/apache/arrow/pull/12442#issuecomment-1076591273


   > Alright, I tried setting up UCX over EFA's UD support. Unfortunately it 
doesn't seem to work:
   > 
   > ```
   > # Memory domain: rdmap0s6
   > #     Component: ib
   > #             register: unlimited, cost: 180 nsec
   > #           remote key: 8 bytes
   > #           local memory handle is required for zcopy
   > #
   > #      Transport: ud_verbs
   > #         Device: rdmap0s6:1
   > #           Type: network
   > #  System device: rdmap0s6 (1)
   > [1648043869.866926] [ip-172-31-42-103:2699 :0]        ib_iface.c:1035 UCX  
ERROR ibv_create_cq(cqe=1024) failed: Operation not supported
   > #   < failed to open interface >
   > # < failed to open connection manager rdmacm >
   > ```
   > 
   > fi_pingpong does work over EFA. System details:
   > 
   > ```
   > $ ucx_info -v
   > # UCT version=1.12.0 revision d367332
   > # configured with: --disable-logging --disable-debug --disable-assertions 
--disable-params-check --prefix=/home/ubuntu/prefix/ --enable-compiler-opt 
--enable-mt --with-avx --with-sse42 --with-mcpu --with-march 
--with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa --with-ud
   > $ fi_info -p efa -t FI_EP_RDM
   > provider: efa
   >     fabric: EFA-fe80::8c4:1ff:fe4e:d730
   >     domain: rdmap0s6-rdm
   >     version: 114.0
   >     type: FI_EP_RDM
   >     protocol: FI_PROTO_EFA
   > $ lsmod | grep '\(^ib\|^rdma\)'
   > ib_iser                45056  0
   > rdma_cm               114688  1 ib_iser
   > ib_cm                 122880  1 rdma_cm
   > ib_uverbs             159744  1 efa
   > ib_core               360448  7 rdma_cm,efa,iw_cm,ib_iser,ib_uverbs,ib_cm
   > ubuntu@ip-172-31-42-103:~$ ibv_devinfo
   > hca_id:    rdmap0s6
   >    transport:                      unspecified (4)
   >    fw_ver:                         0.0.0.0
   >    node_guid:                      0000:0000:0000:0000
   >    sys_image_guid:                 0000:0000:0000:0000
   >    vendor_id:                      0x1d0f
   >    vendor_part_id:                 61344
   >    hw_ver:                         0xEFA0
   >    phys_port_cnt:                  1
   >            port:   1
   >                    state:                  PORT_ACTIVE (4)
   >                    max_mtu:                4096 (5)
   >                    active_mtu:             4096 (5)
   >                    sm_lid:                 0
   >                    port_lid:               0
   >                    port_lmc:               0x01
   >                    link_layer:             Unspecified
   > ```
   > 
   > I suppose this is needed: 
[openucx/ucx#6353](https://github.com/openucx/ucx/pull/6353) However it still 
doesn't seem to work:
   > 
   > ```
   > $ ./contrib/configure-release --prefix=/home/ubuntu/prefix/ 
--enable-compiler-opt --enable-mt --with-avx --with-sse42 --with-mcpu 
--with-march --with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa 
--with-ud --with-efa-dv=/usr
   > ...
   > checking infiniband/efadv.h usability... yes
   > checking infiniband/efadv.h presence... yes
   > checking for infiniband/efadv.h... yes
   > checking for efadv_query_device in -lefa... yes
   > checking whether EFADV_DEVICE_ATTR_CAPS_RDMA_READ is declared... yes
   > checking whether IBV_QP_INIT_ATTR_SEND_OPS_FLAGS is declared... yes
   > checking whether efadv_create_qp_ex is declared... yes
   > ...
   > $ ucx_info -v
   > # UCT version=1.12.0 revision 12ca5ef
   > # configured with: --disable-logging --disable-debug --disable-assertions 
--disable-params-check --prefix=/home/ubuntu/prefix/ --enable-compiler-opt 
--enable-mt --with-avx --with-sse42 --with-mcpu --with-march 
--with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa --with-ud 
--with-efa-dv=/usr
   > $ ucx_info -d
   > ...
   > # Memory domain: rdmap0s6
   > #     Component: ib
   > #             register: unlimited, cost: 180 nsec
   > #           remote key: 8 bytes
   > #           local memory handle is required for zcopy
   > #
   > #      Transport: ud_verbs
   > #         Device: rdmap0s6:1
   > #  System device: 0000:00:06.0 (0)
   > [1648045582.667238] [ip-172-31-42-103:154374:0]        ib_iface.c:1034 UCX 
 ERROR ibv_create_cq(cqe=256) failed: Operation not supported
   > #   < failed to open interface >
   > # < failed to open connection manager rdmacm >
   > [1648045154.652145] [ip-172-31-42-103:93167:0]        ib_iface.c:1034 UCX  
ERROR ibv_create_cq(cqe=256) failed: Operation not supported
   > #   < failed to open interface >
   > # < failed to open connection manager rdmacm >
   > ```
   > 
   > I also tried the SRD support 
([openucx/ucx#6636](https://github.com/openucx/ucx/pull/6636)), but that seems 
suspect too:
   > 
   > ```
   > $ ./contrib/configure-release --prefix=/home/ubuntu/srd-prefix/ 
--enable-compiler-opt --enable-mt --with-avx --with-sse42 --with-mcpu 
--with-march --with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa 
--with-ud --with-efa-dv=/usr --with-srd
   > ...
   > checking infiniband/efadv.h usability... yes
   > checking infiniband/efadv.h presence... yes
   > checking for infiniband/efadv.h... yes
   > checking for efadv_query_device in -lefa... yes
   > checking whether EFADV_DEVICE_ATTR_CAPS_RDMA_READ is declared... yes
   > checking whether IBV_QP_INIT_ATTR_SEND_OPS_FLAGS is declared... yes
   > checking whether efadv_create_qp_ex is declared... yes
   > $ ucx_info -v
   > # UCT version=1.12.0 revision 0426d4e
   > # configured with: --disable-logging --disable-debug --disable-assertions 
--disable-params-check --prefix=/home/ubuntu/srd-prefix/ --enable-compiler-opt 
--enable-mt --with-avx --with-sse42 --with-mcpu --with-march 
--with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa --with-ud 
--with-efa-dv=/usr --with-srd
   > $ ucx_info -d
   > # Memory domain: rdmap0s6
   > #     Component: ib
   > #             register: unlimited, cost: 180 nsec
   > #           remote key: 8 bytes
   > #           local memory handle is required for zcopy
   > #
   > #      Transport: ud_verbs
   > #         Device: rdmap0s6:1
   > #           Type: network
   > #  System device: rdmap0s6 (1)
   > [1648045656.572456] [ip-172-31-42-103:158057:0]        ib_iface.c:1029 UCX 
 ERROR ibv_create_cq(cqe=256) failed: Operation not supported
   > #   < failed to open interface >
   > #
   > #      Transport: srd
   > #         Device: rdmap0s6:1
   > #           Type: network
   > #  System device: rdmap0s6 (1)
   > [1648045656.572538] [ip-172-31-42-103:158057:0]        ib_iface.c:1029 UCX 
 ERROR ibv_create_cq(cqe=256) failed: Operation not supported
   > #   < failed to open interface >
   > # < failed to open connection manager rdmacm >
   > ...
   > ```
   
   AFAIK support for EFA is still work in progress and you may step on some 
bugs there. @SeyedMir will know better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to