shamisp commented on pull request #12442: URL: https://github.com/apache/arrow/pull/12442#issuecomment-1076591273
> Alright, I tried setting up UCX over EFA's UD support. Unfortunately it doesn't seem to work: > > ``` > # Memory domain: rdmap0s6 > # Component: ib > # register: unlimited, cost: 180 nsec > # remote key: 8 bytes > # local memory handle is required for zcopy > # > # Transport: ud_verbs > # Device: rdmap0s6:1 > # Type: network > # System device: rdmap0s6 (1) > [1648043869.866926] [ip-172-31-42-103:2699 :0] ib_iface.c:1035 UCX ERROR ibv_create_cq(cqe=1024) failed: Operation not supported > # < failed to open interface > > # < failed to open connection manager rdmacm > > ``` > > fi_pingpong does work over EFA. System details: > > ``` > $ ucx_info -v > # UCT version=1.12.0 revision d367332 > # configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/home/ubuntu/prefix/ --enable-compiler-opt --enable-mt --with-avx --with-sse42 --with-mcpu --with-march --with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa --with-ud > $ fi_info -p efa -t FI_EP_RDM > provider: efa > fabric: EFA-fe80::8c4:1ff:fe4e:d730 > domain: rdmap0s6-rdm > version: 114.0 > type: FI_EP_RDM > protocol: FI_PROTO_EFA > $ lsmod | grep '\(^ib\|^rdma\)' > ib_iser 45056 0 > rdma_cm 114688 1 ib_iser > ib_cm 122880 1 rdma_cm > ib_uverbs 159744 1 efa > ib_core 360448 7 rdma_cm,efa,iw_cm,ib_iser,ib_uverbs,ib_cm > ubuntu@ip-172-31-42-103:~$ ibv_devinfo > hca_id: rdmap0s6 > transport: unspecified (4) > fw_ver: 0.0.0.0 > node_guid: 0000:0000:0000:0000 > sys_image_guid: 0000:0000:0000:0000 > vendor_id: 0x1d0f > vendor_part_id: 61344 > hw_ver: 0xEFA0 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 4096 (5) > active_mtu: 4096 (5) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x01 > link_layer: Unspecified > ``` > > I suppose this is needed: [openucx/ucx#6353](https://github.com/openucx/ucx/pull/6353) However it still doesn't seem to work: > > ``` > $ ./contrib/configure-release --prefix=/home/ubuntu/prefix/ --enable-compiler-opt --enable-mt --with-avx --with-sse42 --with-mcpu --with-march --with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa --with-ud --with-efa-dv=/usr > ... > checking infiniband/efadv.h usability... yes > checking infiniband/efadv.h presence... yes > checking for infiniband/efadv.h... yes > checking for efadv_query_device in -lefa... yes > checking whether EFADV_DEVICE_ATTR_CAPS_RDMA_READ is declared... yes > checking whether IBV_QP_INIT_ATTR_SEND_OPS_FLAGS is declared... yes > checking whether efadv_create_qp_ex is declared... yes > ... > $ ucx_info -v > # UCT version=1.12.0 revision 12ca5ef > # configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/home/ubuntu/prefix/ --enable-compiler-opt --enable-mt --with-avx --with-sse42 --with-mcpu --with-march --with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa --with-ud --with-efa-dv=/usr > $ ucx_info -d > ... > # Memory domain: rdmap0s6 > # Component: ib > # register: unlimited, cost: 180 nsec > # remote key: 8 bytes > # local memory handle is required for zcopy > # > # Transport: ud_verbs > # Device: rdmap0s6:1 > # System device: 0000:00:06.0 (0) > [1648045582.667238] [ip-172-31-42-103:154374:0] ib_iface.c:1034 UCX ERROR ibv_create_cq(cqe=256) failed: Operation not supported > # < failed to open interface > > # < failed to open connection manager rdmacm > > [1648045154.652145] [ip-172-31-42-103:93167:0] ib_iface.c:1034 UCX ERROR ibv_create_cq(cqe=256) failed: Operation not supported > # < failed to open interface > > # < failed to open connection manager rdmacm > > ``` > > I also tried the SRD support ([openucx/ucx#6636](https://github.com/openucx/ucx/pull/6636)), but that seems suspect too: > > ``` > $ ./contrib/configure-release --prefix=/home/ubuntu/srd-prefix/ --enable-compiler-opt --enable-mt --with-avx --with-sse42 --with-mcpu --with-march --with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa --with-ud --with-efa-dv=/usr --with-srd > ... > checking infiniband/efadv.h usability... yes > checking infiniband/efadv.h presence... yes > checking for infiniband/efadv.h... yes > checking for efadv_query_device in -lefa... yes > checking whether EFADV_DEVICE_ATTR_CAPS_RDMA_READ is declared... yes > checking whether IBV_QP_INIT_ATTR_SEND_OPS_FLAGS is declared... yes > checking whether efadv_create_qp_ex is declared... yes > $ ucx_info -v > # UCT version=1.12.0 revision 0426d4e > # configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/home/ubuntu/srd-prefix/ --enable-compiler-opt --enable-mt --with-avx --with-sse42 --with-mcpu --with-march --with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa --with-ud --with-efa-dv=/usr --with-srd > $ ucx_info -d > # Memory domain: rdmap0s6 > # Component: ib > # register: unlimited, cost: 180 nsec > # remote key: 8 bytes > # local memory handle is required for zcopy > # > # Transport: ud_verbs > # Device: rdmap0s6:1 > # Type: network > # System device: rdmap0s6 (1) > [1648045656.572456] [ip-172-31-42-103:158057:0] ib_iface.c:1029 UCX ERROR ibv_create_cq(cqe=256) failed: Operation not supported > # < failed to open interface > > # > # Transport: srd > # Device: rdmap0s6:1 > # Type: network > # System device: rdmap0s6 (1) > [1648045656.572538] [ip-172-31-42-103:158057:0] ib_iface.c:1029 UCX ERROR ibv_create_cq(cqe=256) failed: Operation not supported > # < failed to open interface > > # < failed to open connection manager rdmacm > > ... > ``` AFAIK support for EFA is still work in progress and you may step on some bugs there. @SeyedMir will know better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
