lidavidm commented on pull request #12442:
URL: https://github.com/apache/arrow/pull/12442#issuecomment-1076443582


   Alright, I tried setting up UCX over EFA's UD support. Unfortunately it 
doesn't seem to work:
   
   ```
   # Memory domain: rdmap0s6
   #     Component: ib
   #             register: unlimited, cost: 180 nsec
   #           remote key: 8 bytes
   #           local memory handle is required for zcopy
   #
   #      Transport: ud_verbs
   #         Device: rdmap0s6:1
   #           Type: network
   #  System device: rdmap0s6 (1)
   [1648043869.866926] [ip-172-31-42-103:2699 :0]        ib_iface.c:1035 UCX  
ERROR ibv_create_cq(cqe=1024) failed: Operation not supported
   #   < failed to open interface >
   # < failed to open connection manager rdmacm >
   ```
   
   fi_pingpong does work over EFA. System details:
   
   ```
   $ ucx_info -v
   # UCT version=1.12.0 revision d367332
   # configured with: --disable-logging --disable-debug --disable-assertions 
--disable-params-check --prefix=/home/ubuntu/prefix/ --enable-compiler-opt 
--enable-mt --with-avx --with-sse42 --with-mcpu --with-march 
--with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa --with-ud
   $ fi_info -p efa -t FI_EP_RDM
   provider: efa
       fabric: EFA-fe80::8c4:1ff:fe4e:d730
       domain: rdmap0s6-rdm
       version: 114.0
       type: FI_EP_RDM
       protocol: FI_PROTO_EFA
   $ lsmod | grep '\(^ib\|^rdma\)'
   ib_iser                45056  0
   rdma_cm               114688  1 ib_iser
   ib_cm                 122880  1 rdma_cm
   ib_uverbs             159744  1 efa
   ib_core               360448  7 rdma_cm,efa,iw_cm,ib_iser,ib_uverbs,ib_cm
   ubuntu@ip-172-31-42-103:~$ ibv_devinfo
   hca_id:      rdmap0s6
        transport:                      unspecified (4)
        fw_ver:                         0.0.0.0
        node_guid:                      0000:0000:0000:0000
        sys_image_guid:                 0000:0000:0000:0000
        vendor_id:                      0x1d0f
        vendor_part_id:                 61344
        hw_ver:                         0xEFA0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x01
                        link_layer:             Unspecified
   ```
   
   I suppose this is needed: https://github.com/openucx/ucx/pull/6353 However 
it still doesn't seem to work:
   
   ```
   $ ./contrib/configure-release --prefix=/home/ubuntu/prefix/ 
--enable-compiler-opt --enable-mt --with-avx --with-sse42 --with-mcpu 
--with-march --with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa 
--with-ud --with-efa-dv=/usr
   ...
   checking infiniband/efadv.h usability... yes
   checking infiniband/efadv.h presence... yes
   checking for infiniband/efadv.h... yes
   checking for efadv_query_device in -lefa... yes
   checking whether EFADV_DEVICE_ATTR_CAPS_RDMA_READ is declared... yes
   checking whether IBV_QP_INIT_ATTR_SEND_OPS_FLAGS is declared... yes
   checking whether efadv_create_qp_ex is declared... yes
   ...
   $ ucx_info -v
   # UCT version=1.12.0 revision 12ca5ef
   # configured with: --disable-logging --disable-debug --disable-assertions 
--disable-params-check --prefix=/home/ubuntu/prefix/ --enable-compiler-opt 
--enable-mt --with-avx --with-sse42 --with-mcpu --with-march 
--with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa --with-ud 
--with-efa-dv=/usr
   $ ucx_info -d
   ...
   # Memory domain: rdmap0s6
   #     Component: ib
   #             register: unlimited, cost: 180 nsec
   #           remote key: 8 bytes
   #           local memory handle is required for zcopy
   #
   #      Transport: ud_verbs
   #         Device: rdmap0s6:1
   #  System device: 0000:00:06.0 (0)
   [1648045582.667238] [ip-172-31-42-103:154374:0]        ib_iface.c:1034 UCX  
ERROR ibv_create_cq(cqe=256) failed: Operation not supported
   #   < failed to open interface >
   # < failed to open connection manager rdmacm >
   [1648045154.652145] [ip-172-31-42-103:93167:0]        ib_iface.c:1034 UCX  
ERROR ibv_create_cq(cqe=256) failed: Operation not supported
   #   < failed to open interface >
   # < failed to open connection manager rdmacm >
   ```
   
   I also tried the SRD support (https://github.com/openucx/ucx/pull/6636), but 
that seems suspect too:
   ```
   $ ./contrib/configure-release --prefix=/home/ubuntu/srd-prefix/ 
--enable-compiler-opt --enable-mt --with-avx --with-sse42 --with-mcpu 
--with-march --with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa 
--with-ud --with-efa-dv=/usr --with-srd
   ...
   checking infiniband/efadv.h usability... yes
   checking infiniband/efadv.h presence... yes
   checking for infiniband/efadv.h... yes
   checking for efadv_query_device in -lefa... yes
   checking whether EFADV_DEVICE_ATTR_CAPS_RDMA_READ is declared... yes
   checking whether IBV_QP_INIT_ATTR_SEND_OPS_FLAGS is declared... yes
   checking whether efadv_create_qp_ex is declared... yes
   $ ucx_info -v
   # UCT version=1.12.0 revision 0426d4e
   # configured with: --disable-logging --disable-debug --disable-assertions 
--disable-params-check --prefix=/home/ubuntu/srd-prefix/ --enable-compiler-opt 
--enable-mt --with-avx --with-sse42 --with-mcpu --with-march 
--with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa --with-ud 
--with-efa-dv=/usr --with-srd
   $ ucx_info -d
   # Memory domain: rdmap0s6
   #     Component: ib
   #             register: unlimited, cost: 180 nsec
   #           remote key: 8 bytes
   #           local memory handle is required for zcopy
   #
   #      Transport: ud_verbs
   #         Device: rdmap0s6:1
   #           Type: network
   #  System device: rdmap0s6 (1)
   [1648045656.572456] [ip-172-31-42-103:158057:0]        ib_iface.c:1029 UCX  
ERROR ibv_create_cq(cqe=256) failed: Operation not supported
   #   < failed to open interface >
   #
   #      Transport: srd
   #         Device: rdmap0s6:1
   #           Type: network
   #  System device: rdmap0s6 (1)
   [1648045656.572538] [ip-172-31-42-103:158057:0]        ib_iface.c:1029 UCX  
ERROR ibv_create_cq(cqe=256) failed: Operation not supported
   #   < failed to open interface >
   # < failed to open connection manager rdmacm >
   ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to