lidavidm commented on pull request #12442:
URL: https://github.com/apache/arrow/pull/12442#issuecomment-1073859634


   I did a test on Amazon EC2 over TCP. I'll have to try again with UD/EFA when 
I get a chance.
   
   Setup: g4dn.8xlarge, guaranteed 50Gbps network throughput, two instances in 
a cluster placement group
   Ubuntu 20.04
   GCC 9.4.0
   CUDA 11.6
   
   <details><summary><tt>ucx_info -v</tt></summary>
   
   ```
   $ ucx_info -v
   # UCT version=1.12.0 revision d367332
   # configured with: --disable-logging --disable-debug --disable-assertions 
--disable-params-check --prefix=/home/ubuntu/prefix/ --enable-compiler-opt 
--enable-mt --with-avx --with-sse42 --with-mcpu --with-march 
--with-cuda=/usr/local/cuda-11.6/
   $ ucx_info -d
   #
   # Memory domain: posix
   #     Component: posix
   #             allocate: <= 65203328K
   #           remote key: 24 bytes
   #           rkey_ptr is supported
   #
   #      Transport: posix
   #         Device: memory
   #           Type: intra-node
   #  System device: <unknown>
   #
   #      capabilities:
   #            bandwidth: 0.00/ppn + 12179.00 MB/sec
   #              latency: 80 nsec
   #             overhead: 10 nsec
   #            put_short: <= 4294967295
   #            put_bcopy: unlimited
   #            get_bcopy: unlimited
   #             am_short: <= 100
   #             am_bcopy: <= 8256
   #               domain: cpu
   #           atomic_add: 32, 64 bit
   #           atomic_and: 32, 64 bit
   #            atomic_or: 32, 64 bit
   #           atomic_xor: 32, 64 bit
   #          atomic_fadd: 32, 64 bit
   #          atomic_fand: 32, 64 bit
   #           atomic_for: 32, 64 bit
   #          atomic_fxor: 32, 64 bit
   #          atomic_swap: 32, 64 bit
   #         atomic_cswap: 32, 64 bit
   #           connection: to iface
   #      device priority: 0
   #     device num paths: 1
   #              max eps: inf
   #       device address: 8 bytes
   #        iface address: 8 bytes
   #       error handling: ep_check
   #
   #
   # Memory domain: sysv
   #     Component: sysv
   #             allocate: unlimited
   #           remote key: 12 bytes
   #           rkey_ptr is supported
   #
   #      Transport: sysv
   #         Device: memory
   #           Type: intra-node
   #  System device: <unknown>
   #
   #      capabilities:
   #            bandwidth: 0.00/ppn + 12179.00 MB/sec
   #              latency: 80 nsec
   #             overhead: 10 nsec
   #            put_short: <= 4294967295
   #            put_bcopy: unlimited
   #            get_bcopy: unlimited
   #             am_short: <= 100
   #             am_bcopy: <= 8256
   #               domain: cpu
   #           atomic_add: 32, 64 bit
   #           atomic_and: 32, 64 bit
   #            atomic_or: 32, 64 bit
   #           atomic_xor: 32, 64 bit
   #          atomic_fadd: 32, 64 bit
   #          atomic_fand: 32, 64 bit
   #           atomic_for: 32, 64 bit
   #          atomic_fxor: 32, 64 bit
   #          atomic_swap: 32, 64 bit
   #         atomic_cswap: 32, 64 bit
   #           connection: to iface
   #      device priority: 0
   #     device num paths: 1
   #              max eps: inf
   #       device address: 8 bytes
   #        iface address: 8 bytes
   #       error handling: ep_check
   #
   #
   # Memory domain: self
   #     Component: self
   #             register: unlimited, cost: 0 nsec
   #           remote key: 0 bytes
   #
   #      Transport: self
   #         Device: memory0
   #           Type: loopback
   #  System device: <unknown>
   #
   #      capabilities:
   #            bandwidth: 0.00/ppn + 6911.00 MB/sec
   #              latency: 0 nsec
   #             overhead: 10 nsec
   #            put_short: <= 4294967295
   #            put_bcopy: unlimited
   #            get_bcopy: unlimited
   #             am_short: <= 8K
   #             am_bcopy: <= 8K
   #               domain: cpu
   #           atomic_add: 32, 64 bit
   #           atomic_and: 32, 64 bit
   #            atomic_or: 32, 64 bit
   #           atomic_xor: 32, 64 bit
   #          atomic_fadd: 32, 64 bit
   #          atomic_fand: 32, 64 bit
   #           atomic_for: 32, 64 bit
   #          atomic_fxor: 32, 64 bit
   #          atomic_swap: 32, 64 bit
   #         atomic_cswap: 32, 64 bit
   #           connection: to iface
   #      device priority: 0
   #     device num paths: 1
   #              max eps: inf
   #       device address: 0 bytes
   #        iface address: 8 bytes
   #       error handling: ep_check
   #
   #
   # Memory domain: tcp
   #     Component: tcp
   #             register: unlimited, cost: 0 nsec
   #           remote key: 0 bytes
   #
   #      Transport: tcp
   #         Device: ens5
   #           Type: network
   #  System device: <unknown>
   #
   #      capabilities:
   #            bandwidth: 11.82/ppn + 0.00 MB/sec
   #              latency: 10960 nsec
   #             overhead: 50000 nsec
   #            put_zcopy: <= 18446744073709551590, up to 6 iov
   #  put_opt_zcopy_align: <= 1
   #        put_align_mtu: <= 0
   #             am_short: <= 8K
   #             am_bcopy: <= 8K
   #             am_zcopy: <= 64K, up to 6 iov
   #   am_opt_zcopy_align: <= 1
   #         am_align_mtu: <= 0
   #            am header: <= 8037
   #           connection: to ep, to iface
   #      device priority: 0
   #     device num paths: 1
   #              max eps: 256
   #       device address: 6 bytes
   #        iface address: 2 bytes
   #           ep address: 10 bytes
   #       error handling: peer failure, ep_check, keepalive
   #
   #      Transport: tcp
   #         Device: lo
   #           Type: network
   #  System device: <unknown>
   #
   #      capabilities:
   #            bandwidth: 11.91/ppn + 0.00 MB/sec
   #              latency: 10960 nsec
   #             overhead: 50000 nsec
   #            put_zcopy: <= 18446744073709551590, up to 6 iov
   #  put_opt_zcopy_align: <= 1
   #        put_align_mtu: <= 0
   #             am_short: <= 8K
   #             am_bcopy: <= 8K
   #             am_zcopy: <= 64K, up to 6 iov
   #   am_opt_zcopy_align: <= 1
   #         am_align_mtu: <= 0
   #            am header: <= 8037
   #           connection: to ep, to iface
   #      device priority: 1
   #     device num paths: 1
   #              max eps: 256
   #       device address: 18 bytes
   #        iface address: 2 bytes
   #           ep address: 10 bytes
   #       error handling: peer failure, ep_check, keepalive
   #
   #
   # Connection manager: tcp
   #      max_conn_priv: 2064 bytes
   #
   # Memory domain: cuda_cpy
   #     Component: cuda_cpy
   #             allocate: unlimited
   #             register: unlimited, cost: 0 nsec
   #
   #      Transport: cuda_copy
   #         Device: cuda
   #           Type: accelerator
   #  System device: <unknown>
   #
   #      capabilities:
   #            bandwidth: 10000.00/ppn + 0.00 MB/sec
   #              latency: 8000 nsec
   #             overhead: 0 nsec
   #            put_short: <= 4294967295
   #            put_zcopy: unlimited, up to 1 iov
   #  put_opt_zcopy_align: <= 1
   #        put_align_mtu: <= 1
   #            get_short: <= 4294967295
   #            get_zcopy: unlimited, up to 1 iov
   #  get_opt_zcopy_align: <= 1
   #        get_align_mtu: <= 1
   #           connection: to iface
   #      device priority: 0
   #     device num paths: 1
   #              max eps: inf
   #       device address: 0 bytes
   #        iface address: 8 bytes
   #       error handling: none
   #
   #
   # Memory domain: cuda_ipc
   #     Component: cuda_ipc
   #             register: unlimited, cost: 0 nsec
   #           remote key: 112 bytes
   #
   #      Transport: cuda_ipc
   #         Device: cuda
   #           Type: intra-node
   #  System device: <unknown>
   #
   #      capabilities:
   #            bandwidth: 250000.00/ppn + 0.00 MB/sec
   #              latency: 1 nsec
   #             overhead: 0 nsec
   #            put_zcopy: unlimited, up to 1 iov
   #  put_opt_zcopy_align: <= 1
   #        put_align_mtu: <= 1
   #            get_zcopy: <= 0, up to 1 iov
   #  get_opt_zcopy_align: <= 1
   #        get_align_mtu: <= 1
   #           connection: to iface
   #      device priority: 0
   #     device num paths: 1
   #              max eps: inf
   #       device address: 8 bytes
   #        iface address: 4 bytes
   #       error handling: peer failure, ep_check
   #
   #
   # Memory domain: cma
   #     Component: cma
   #             register: unlimited, cost: 9 nsec
   #
   #      Transport: cma
   #         Device: memory
   #           Type: intra-node
   #  System device: <unknown>
   #
   #      capabilities:
   #            bandwidth: 0.00/ppn + 11145.00 MB/sec
   #              latency: 80 nsec
   #             overhead: 2000 nsec
   #            put_zcopy: unlimited, up to 16 iov
   #  put_opt_zcopy_align: <= 1
   #        put_align_mtu: <= 1
   #            get_zcopy: unlimited, up to 16 iov
   #  get_opt_zcopy_align: <= 1
   #        get_align_mtu: <= 1
   #           connection: to iface
   #      device priority: 0
   #     device num paths: 1
   #              max eps: inf
   #       device address: 8 bytes
   #        iface address: 4 bytes
   #       error handling: peer failure, ep_check
   #
   ```
   
   </details>
   
   <details><summary>iperf3 baseline</summary>
   
   ```
   # One stream
   $ iperf3 -c ip-172-31-36-32 -p 8008 -f G -Z
   Connecting to host ip-172-31-36-32, port 8008
   [  5] local 172.31.43.230 port 49538 connected to 172.31.36.32 port 8008
   [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
   [  5]   0.00-1.00   sec  1.11 GBytes  1.11 GBytes/sec    0   1.35 MBytes
   [  5]   1.00-2.00   sec  1.11 GBytes  1.11 GBytes/sec    0   1.35 MBytes
   [  5]   2.00-3.00   sec  1.11 GBytes  1.11 GBytes/sec    0   1.35 MBytes
   [  5]   3.00-4.00   sec  1.11 GBytes  1.11 GBytes/sec    0   1.35 MBytes
   [  5]   4.00-5.00   sec  1.11 GBytes  1.11 GBytes/sec    0   1.35 MBytes
   [  5]   5.00-6.00   sec  1.11 GBytes  1.11 GBytes/sec    0   1.42 MBytes
   [  5]   6.00-7.00   sec  1.11 GBytes  1.11 GBytes/sec    0   1.42 MBytes
   [  5]   7.00-8.00   sec  1.11 GBytes  1.11 GBytes/sec    0   1.42 MBytes
   [  5]   8.00-9.00   sec  1.11 GBytes  1.11 GBytes/sec    0   1.42 MBytes
   [  5]   9.00-10.00  sec  1.11 GBytes  1.11 GBytes/sec    0   2.34 MBytes
   - - - - - - - - - - - - - - - - - - - - - - - - -
   [ ID] Interval           Transfer     Bitrate         Retr
   [  5]   0.00-10.00  sec  11.1 GBytes  1.11 GBytes/sec    0             sender
   [  5]   0.00-10.00  sec  11.1 GBytes  1.11 GBytes/sec                  
receiver
   
   # Multi stream
   [ ID] Interval           Transfer     Bitrate         Retr
   [  5]   0.00-10.00  sec  4.97 GBytes  4.27 Gbits/sec    0             sender
   [  5]   0.00-10.00  sec  4.97 GBytes  4.26 Gbits/sec                  
receiver
   [  7]   0.00-10.00  sec  4.97 GBytes  4.27 Gbits/sec    0             sender
   [  7]   0.00-10.00  sec  4.97 GBytes  4.26 Gbits/sec                  
receiver
   [  9]   0.00-10.00  sec  4.97 GBytes  4.27 Gbits/sec    0             sender
   [  9]   0.00-10.00  sec  4.97 GBytes  4.26 Gbits/sec                  
receiver
   [ 11]   0.00-10.00  sec  4.97 GBytes  4.27 Gbits/sec    0             sender
   [ 11]   0.00-10.00  sec  4.97 GBytes  4.26 Gbits/sec                  
receiver
   [ 13]   0.00-10.00  sec  4.97 GBytes  4.27 Gbits/sec    0             sender
   [ 13]   0.00-10.00  sec  4.97 GBytes  4.26 Gbits/sec                  
receiver
   [ 15]   0.00-10.00  sec  4.97 GBytes  4.27 Gbits/sec    0             sender
   [ 15]   0.00-10.00  sec  4.97 GBytes  4.26 Gbits/sec                  
receiver
   [ 17]   0.00-10.00  sec  4.97 GBytes  4.27 Gbits/sec    0             sender
   [ 17]   0.00-10.00  sec  4.96 GBytes  4.26 Gbits/sec                  
receiver
   [ 19]   0.00-10.00  sec  4.97 GBytes  4.27 Gbits/sec    0             sender
   [ 19]   0.00-10.00  sec  4.97 GBytes  4.26 Gbits/sec                  
receiver
   [SUM]   0.00-10.00  sec  39.7 GBytes  34.1 Gbits/sec    0             sender
   [SUM]   0.00-10.00  sec  39.7 GBytes  34.1 Gbits/sec                  
receiver
   ```
   
   </details>
   
   ## Results
   
   ### Local/loopback
   
   128 KB batch size:
   
   | # concurrent streams | gRPC (MB/s)     | UCX (MB/s)      |
   | :------------------- | --------------: | --------------: |
   | 1                    | 2693.7          | 6503.4          |
   | 2                    | 5265.4          | 12559.0         |
   | 4                    | 10092.7         | (error)         |
   | 8                    | 16470.0         | (error)         |
   | 16                   | 21834.6         | (error)         |
   
   2 MB batch size:
   
   | # concurrent streams | gRPC (MB/s)     | UCX (MB/s)      |
   | :------------------- | --------------: | --------------: |
   | 1                    | 2766.9          | 7684.0          |
   | 2                    | 4473.6          | 13657.4         |
   | 4                    | 7347.5          | (error)         |
   | 8                    | 12097.5         | (error)         |
   | 16                   | 14951.0         | (error)         |
   
   The UCX transport got stuck at or above 3 concurrent streams.
   I can reproduce locally (though with less frequency) and will investigate.
   
   ### Remote/TCP
   
   Theoretical max: 50Gbits/s ~= 5960 MBytes/s
   
   128 KB batch size:
   
   | # concurrent streams | gRPC (MB/s)     | UCX (MB/s)      |
   | :------------------- | --------------: | --------------: |
   | 1                    | 1129.4          | 1068.6          |
   | 2                    | 2258.2          | 2129.0          |
   | 4                    | 3896.2          | 3925.9          |
   | 8                    | 4219.7          | 4164.3          |
   | 16                   | 5299.3          | 4810.3          |
   
   2 MB batch size:
   
   | # concurrent streams | gRPC (MB/s)     | UCX (MB/s)      |
   | :------------------- | --------------: | --------------: |
   | 1                    | 1131.9          | 1125.0          |
   | 2                    | 2265.3          | 2244.6          |
   | 4                    | 3619.7          | 3216.5          |
   | 8                    | 4852.4          | 3085.6          |
   | 16                   | 5128.1          | (omitted)       |
   
   Some thoughts:
   - The Flight/UCX transport is implemented with one-worker-per-stream, but 
maybe that leads to too much resource usage (and hence worse scaling) with many 
concurrent streams?
   - The shared memory transport handily beats loopback; we just need to fix 
the bug leading it to get stuck.
   - Neither transport nor iperf3 seem to be able to get the full throughput.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to