lidavidm commented on pull request #12442:
URL: https://github.com/apache/arrow/pull/12442#issuecomment-1076443582
Alright, I tried setting up UCX over EFA's UD support. Unfortunately it
doesn't seem to work:
```
# Memory domain: rdmap0s6
# Component: ib
# register: unlimited, cost: 180 nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
#
# Transport: ud_verbs
# Device: rdmap0s6:1
# Type: network
# System device: rdmap0s6 (1)
[1648043869.866926] [ip-172-31-42-103:2699 :0] ib_iface.c:1035 UCX
ERROR ibv_create_cq(cqe=1024) failed: Operation not supported
# < failed to open interface >
# < failed to open connection manager rdmacm >
```
fi_pingpong does work over EFA. System details:
```
$ ucx_info -v
# UCT version=1.12.0 revision d367332
# configured with: --disable-logging --disable-debug --disable-assertions
--disable-params-check --prefix=/home/ubuntu/prefix/ --enable-compiler-opt
--enable-mt --with-avx --with-sse42 --with-mcpu --with-march
--with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa --with-ud
$ fi_info -p efa -t FI_EP_RDM
provider: efa
fabric: EFA-fe80::8c4:1ff:fe4e:d730
domain: rdmap0s6-rdm
version: 114.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
$ lsmod | grep '\(^ib\|^rdma\)'
ib_iser 45056 0
rdma_cm 114688 1 ib_iser
ib_cm 122880 1 rdma_cm
ib_uverbs 159744 1 efa
ib_core 360448 7 rdma_cm,efa,iw_cm,ib_iser,ib_uverbs,ib_cm
ubuntu@ip-172-31-42-103:~$ ibv_devinfo
hca_id: rdmap0s6
transport: unspecified (4)
fw_ver: 0.0.0.0
node_guid: 0000:0000:0000:0000
sys_image_guid: 0000:0000:0000:0000
vendor_id: 0x1d0f
vendor_part_id: 61344
hw_ver: 0xEFA0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x01
link_layer: Unspecified
```
I suppose this is needed: https://github.com/openucx/ucx/pull/6353 However
it still doesn't seem to work:
```
$ ./contrib/configure-release --prefix=/home/ubuntu/prefix/
--enable-compiler-opt --enable-mt --with-avx --with-sse42 --with-mcpu
--with-march --with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa
--with-ud --with-efa-dv=/usr
...
checking infiniband/efadv.h usability... yes
checking infiniband/efadv.h presence... yes
checking for infiniband/efadv.h... yes
checking for efadv_query_device in -lefa... yes
checking whether EFADV_DEVICE_ATTR_CAPS_RDMA_READ is declared... yes
checking whether IBV_QP_INIT_ATTR_SEND_OPS_FLAGS is declared... yes
checking whether efadv_create_qp_ex is declared... yes
...
$ ucx_info -v
# UCT version=1.12.0 revision 12ca5ef
# configured with: --disable-logging --disable-debug --disable-assertions
--disable-params-check --prefix=/home/ubuntu/prefix/ --enable-compiler-opt
--enable-mt --with-avx --with-sse42 --with-mcpu --with-march
--with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa --with-ud
--with-efa-dv=/usr
$ ucx_info -d
...
# Memory domain: rdmap0s6
# Component: ib
# register: unlimited, cost: 180 nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
#
# Transport: ud_verbs
# Device: rdmap0s6:1
# System device: 0000:00:06.0 (0)
[1648045582.667238] [ip-172-31-42-103:154374:0] ib_iface.c:1034 UCX
ERROR ibv_create_cq(cqe=256) failed: Operation not supported
# < failed to open interface >
# < failed to open connection manager rdmacm >
[1648045154.652145] [ip-172-31-42-103:93167:0] ib_iface.c:1034 UCX
ERROR ibv_create_cq(cqe=256) failed: Operation not supported
# < failed to open interface >
# < failed to open connection manager rdmacm >
```
I also tried the SRD support (https://github.com/openucx/ucx/pull/6636), but
that seems suspect too:
```
$ ./contrib/configure-release --prefix=/home/ubuntu/srd-prefix/
--enable-compiler-opt --enable-mt --with-avx --with-sse42 --with-mcpu
--with-march --with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa
--with-ud --with-efa-dv=/usr --with-srd
...
checking infiniband/efadv.h usability... yes
checking infiniband/efadv.h presence... yes
checking for infiniband/efadv.h... yes
checking for efadv_query_device in -lefa... yes
checking whether EFADV_DEVICE_ATTR_CAPS_RDMA_READ is declared... yes
checking whether IBV_QP_INIT_ATTR_SEND_OPS_FLAGS is declared... yes
checking whether efadv_create_qp_ex is declared... yes
$ ucx_info -v
# UCT version=1.12.0 revision 0426d4e
# configured with: --disable-logging --disable-debug --disable-assertions
--disable-params-check --prefix=/home/ubuntu/srd-prefix/ --enable-compiler-opt
--enable-mt --with-avx --with-sse42 --with-mcpu --with-march
--with-cuda=/usr/local/cuda-11.6/ --with-verbs=/opt/amazon/efa --with-ud
--with-efa-dv=/usr --with-srd
$ ucx_info -d
# Memory domain: rdmap0s6
# Component: ib
# register: unlimited, cost: 180 nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
#
# Transport: ud_verbs
# Device: rdmap0s6:1
# Type: network
# System device: rdmap0s6 (1)
[1648045656.572456] [ip-172-31-42-103:158057:0] ib_iface.c:1029 UCX
ERROR ibv_create_cq(cqe=256) failed: Operation not supported
# < failed to open interface >
#
# Transport: srd
# Device: rdmap0s6:1
# Type: network
# System device: rdmap0s6 (1)
[1648045656.572538] [ip-172-31-42-103:158057:0] ib_iface.c:1029 UCX
ERROR ibv_create_cq(cqe=256) failed: Operation not supported
# < failed to open interface >
# < failed to open connection manager rdmacm >
...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]