lidavidm commented on pull request #12442: URL: https://github.com/apache/arrow/pull/12442#issuecomment-1073859634
I did a test on Amazon EC2 over TCP. I'll have to try again with UD/EFA when I get a chance. Setup: g4dn.8xlarge, guaranteed 50Gbps network throughput, two instances in a cluster placement group Ubuntu 20.04 GCC 9.4.0 CUDA 11.6 <details><summary><tt>ucx_info -v</tt></summary> ``` $ ucx_info -v # UCT version=1.12.0 revision d367332 # configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/home/ubuntu/prefix/ --enable-compiler-opt --enable-mt --with-avx --with-sse42 --with-mcpu --with-march --with-cuda=/usr/local/cuda-11.6/ $ ucx_info -d # # Memory domain: posix # Component: posix # allocate: <= 65203328K # remote key: 24 bytes # rkey_ptr is supported # # Transport: posix # Device: memory # Type: intra-node # System device: <unknown> # # capabilities: # bandwidth: 0.00/ppn + 12179.00 MB/sec # latency: 80 nsec # overhead: 10 nsec # put_short: <= 4294967295 # put_bcopy: unlimited # get_bcopy: unlimited # am_short: <= 100 # am_bcopy: <= 8256 # domain: cpu # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 8 bytes # error handling: ep_check # # # Memory domain: sysv # Component: sysv # allocate: unlimited # remote key: 12 bytes # rkey_ptr is supported # # Transport: sysv # Device: memory # Type: intra-node # System device: <unknown> # # capabilities: # bandwidth: 0.00/ppn + 12179.00 MB/sec # latency: 80 nsec # overhead: 10 nsec # put_short: <= 4294967295 # put_bcopy: unlimited # get_bcopy: unlimited # am_short: <= 100 # am_bcopy: <= 8256 # domain: cpu # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 8 bytes # error handling: ep_check # # # Memory domain: self # Component: self # register: unlimited, cost: 0 nsec # remote key: 0 bytes # # Transport: self # Device: memory0 # Type: loopback # System device: <unknown> # # capabilities: # bandwidth: 0.00/ppn + 6911.00 MB/sec # latency: 0 nsec # overhead: 10 nsec # put_short: <= 4294967295 # put_bcopy: unlimited # get_bcopy: unlimited # am_short: <= 8K # am_bcopy: <= 8K # domain: cpu # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 0 bytes # iface address: 8 bytes # error handling: ep_check # # # Memory domain: tcp # Component: tcp # register: unlimited, cost: 0 nsec # remote key: 0 bytes # # Transport: tcp # Device: ens5 # Type: network # System device: <unknown> # # capabilities: # bandwidth: 11.82/ppn + 0.00 MB/sec # latency: 10960 nsec # overhead: 50000 nsec # put_zcopy: <= 18446744073709551590, up to 6 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 0 # am_short: <= 8K # am_bcopy: <= 8K # am_zcopy: <= 64K, up to 6 iov # am_opt_zcopy_align: <= 1 # am_align_mtu: <= 0 # am header: <= 8037 # connection: to ep, to iface # device priority: 0 # device num paths: 1 # max eps: 256 # device address: 6 bytes # iface address: 2 bytes # ep address: 10 bytes # error handling: peer failure, ep_check, keepalive # # Transport: tcp # Device: lo # Type: network # System device: <unknown> # # capabilities: # bandwidth: 11.91/ppn + 0.00 MB/sec # latency: 10960 nsec # overhead: 50000 nsec # put_zcopy: <= 18446744073709551590, up to 6 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 0 # am_short: <= 8K # am_bcopy: <= 8K # am_zcopy: <= 64K, up to 6 iov # am_opt_zcopy_align: <= 1 # am_align_mtu: <= 0 # am header: <= 8037 # connection: to ep, to iface # device priority: 1 # device num paths: 1 # max eps: 256 # device address: 18 bytes # iface address: 2 bytes # ep address: 10 bytes # error handling: peer failure, ep_check, keepalive # # # Connection manager: tcp # max_conn_priv: 2064 bytes # # Memory domain: cuda_cpy # Component: cuda_cpy # allocate: unlimited # register: unlimited, cost: 0 nsec # # Transport: cuda_copy # Device: cuda # Type: accelerator # System device: <unknown> # # capabilities: # bandwidth: 10000.00/ppn + 0.00 MB/sec # latency: 8000 nsec # overhead: 0 nsec # put_short: <= 4294967295 # put_zcopy: unlimited, up to 1 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 1 # get_short: <= 4294967295 # get_zcopy: unlimited, up to 1 iov # get_opt_zcopy_align: <= 1 # get_align_mtu: <= 1 # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 0 bytes # iface address: 8 bytes # error handling: none # # # Memory domain: cuda_ipc # Component: cuda_ipc # register: unlimited, cost: 0 nsec # remote key: 112 bytes # # Transport: cuda_ipc # Device: cuda # Type: intra-node # System device: <unknown> # # capabilities: # bandwidth: 250000.00/ppn + 0.00 MB/sec # latency: 1 nsec # overhead: 0 nsec # put_zcopy: unlimited, up to 1 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 1 # get_zcopy: <= 0, up to 1 iov # get_opt_zcopy_align: <= 1 # get_align_mtu: <= 1 # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 4 bytes # error handling: peer failure, ep_check # # # Memory domain: cma # Component: cma # register: unlimited, cost: 9 nsec # # Transport: cma # Device: memory # Type: intra-node # System device: <unknown> # # capabilities: # bandwidth: 0.00/ppn + 11145.00 MB/sec # latency: 80 nsec # overhead: 2000 nsec # put_zcopy: unlimited, up to 16 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 1 # get_zcopy: unlimited, up to 16 iov # get_opt_zcopy_align: <= 1 # get_align_mtu: <= 1 # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 4 bytes # error handling: peer failure, ep_check # ``` </details> <details><summary>iperf3 baseline</summary> ``` # One stream $ iperf3 -c ip-172-31-36-32 -p 8008 -f G -Z Connecting to host ip-172-31-36-32, port 8008 [ 5] local 172.31.43.230 port 49538 connected to 172.31.36.32 port 8008 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.11 GBytes 1.11 GBytes/sec 0 1.35 MBytes [ 5] 1.00-2.00 sec 1.11 GBytes 1.11 GBytes/sec 0 1.35 MBytes [ 5] 2.00-3.00 sec 1.11 GBytes 1.11 GBytes/sec 0 1.35 MBytes [ 5] 3.00-4.00 sec 1.11 GBytes 1.11 GBytes/sec 0 1.35 MBytes [ 5] 4.00-5.00 sec 1.11 GBytes 1.11 GBytes/sec 0 1.35 MBytes [ 5] 5.00-6.00 sec 1.11 GBytes 1.11 GBytes/sec 0 1.42 MBytes [ 5] 6.00-7.00 sec 1.11 GBytes 1.11 GBytes/sec 0 1.42 MBytes [ 5] 7.00-8.00 sec 1.11 GBytes 1.11 GBytes/sec 0 1.42 MBytes [ 5] 8.00-9.00 sec 1.11 GBytes 1.11 GBytes/sec 0 1.42 MBytes [ 5] 9.00-10.00 sec 1.11 GBytes 1.11 GBytes/sec 0 2.34 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 11.1 GBytes 1.11 GBytes/sec 0 sender [ 5] 0.00-10.00 sec 11.1 GBytes 1.11 GBytes/sec receiver # Multi stream [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 4.97 GBytes 4.27 Gbits/sec 0 sender [ 5] 0.00-10.00 sec 4.97 GBytes 4.26 Gbits/sec receiver [ 7] 0.00-10.00 sec 4.97 GBytes 4.27 Gbits/sec 0 sender [ 7] 0.00-10.00 sec 4.97 GBytes 4.26 Gbits/sec receiver [ 9] 0.00-10.00 sec 4.97 GBytes 4.27 Gbits/sec 0 sender [ 9] 0.00-10.00 sec 4.97 GBytes 4.26 Gbits/sec receiver [ 11] 0.00-10.00 sec 4.97 GBytes 4.27 Gbits/sec 0 sender [ 11] 0.00-10.00 sec 4.97 GBytes 4.26 Gbits/sec receiver [ 13] 0.00-10.00 sec 4.97 GBytes 4.27 Gbits/sec 0 sender [ 13] 0.00-10.00 sec 4.97 GBytes 4.26 Gbits/sec receiver [ 15] 0.00-10.00 sec 4.97 GBytes 4.27 Gbits/sec 0 sender [ 15] 0.00-10.00 sec 4.97 GBytes 4.26 Gbits/sec receiver [ 17] 0.00-10.00 sec 4.97 GBytes 4.27 Gbits/sec 0 sender [ 17] 0.00-10.00 sec 4.96 GBytes 4.26 Gbits/sec receiver [ 19] 0.00-10.00 sec 4.97 GBytes 4.27 Gbits/sec 0 sender [ 19] 0.00-10.00 sec 4.97 GBytes 4.26 Gbits/sec receiver [SUM] 0.00-10.00 sec 39.7 GBytes 34.1 Gbits/sec 0 sender [SUM] 0.00-10.00 sec 39.7 GBytes 34.1 Gbits/sec receiver ``` </details> ## Results ### Local/loopback 128 KB batch size: | # concurrent streams | gRPC (MB/s) | UCX (MB/s) | | :------------------- | --------------: | --------------: | | 1 | 2693.7 | 6503.4 | | 2 | 5265.4 | 12559.0 | | 4 | 10092.7 | (error) | | 8 | 16470.0 | (error) | | 16 | 21834.6 | (error) | 2 MB batch size: | # concurrent streams | gRPC (MB/s) | UCX (MB/s) | | :------------------- | --------------: | --------------: | | 1 | 2766.9 | 7684.0 | | 2 | 4473.6 | 13657.4 | | 4 | 7347.5 | (error) | | 8 | 12097.5 | (error) | | 16 | 14951.0 | (error) | The UCX transport got stuck at or above 3 concurrent streams. I can reproduce locally (though with less frequency) and will investigate. ### Remote/TCP Theoretical max: 50Gbits/s ~= 5960 MBytes/s 128 KB batch size: | # concurrent streams | gRPC (MB/s) | UCX (MB/s) | | :------------------- | --------------: | --------------: | | 1 | 1129.4 | 1068.6 | | 2 | 2258.2 | 2129.0 | | 4 | 3896.2 | 3925.9 | | 8 | 4219.7 | 4164.3 | | 16 | 5299.3 | 4810.3 | 2 MB batch size: | # concurrent streams | gRPC (MB/s) | UCX (MB/s) | | :------------------- | --------------: | --------------: | | 1 | 1131.9 | 1125.0 | | 2 | 2265.3 | 2244.6 | | 4 | 3619.7 | 3216.5 | | 8 | 4852.4 | 3085.6 | | 16 | 5128.1 | (omitted) | Some thoughts: - The Flight/UCX transport is implemented with one-worker-per-stream, but maybe that leads to too much resource usage (and hence worse scaling) with many concurrent streams? - The shared memory transport handily beats loopback; we just need to fix the bug leading it to get stuck. - Neither transport nor iperf3 seem to be able to get the full throughput. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
