Hi Yibo,
Your discovery is impressive. Did you consider the `num_streams` parameter [1] as well? If I understood correctly, this parameter is used for setting the conceptual concurrent streams between the client and the server, while `num_threads` is used for setting the size of the thread pool that actually handles these streams [2]. By default, both of the two parameters are 4. As for CPU usage, the parameter `records_per_batch`[3] has an impact as well. If you increase the value of this parameter, you will probably see that the data transfer speed increased while the server-side CPU usage dropped [4]. My guess is that as more records are put in one record batch, the total number of batches would decrease. CPU is only used for (de)serializing the metadata (i.e. schema) of each record batch while the payload can be transferred with zero cost [5]. [1] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L43 [2] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L230 [3] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L46 [4] https://drive.google.com/file/d/1aH84DdenLr0iH-RuMFU3_q87nPE_HLmP/view?usp=sharing [5] See "Optimizing Data Throughput over gRPC" in https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/ Kind Regards Chengxin Sent with ProtonMail Secure Email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Wednesday, June 17, 2020 8:35 AM, Yibo Cai <yibo....@arm.com> wrote: > Find a way to achieve reasonable benchmark result with multiple threads. Diff > pasted below for a quick review or try. > Tested on E5-2650, with this change: > num_threads = 1, speed = 1996 > num_threads = 2, speed = 3555 > num_threads = 4, speed = 5828 > > When running `arrow_flight_benchmark`, I find there's only one TCP connection > between client and server, no matter what `num_threads` is. All clients share > one TCP connection. At server side, I see only one thread is processing > network packets. On my machine, one client already saturates a CPU core, so > it becomes worse when `num_threads` increase, as that single server thread > becomes bottleneck. > > If running in standalone mode, flight clients are from different processes > and have their own TCP connections to the server. There're separated server > threads handling network traffics for each connection, without a central > bottleneck. > > I'm lucky to find arg GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL[1] just before give > up. Setting that arg makes each client establishes its own TCP connection to > the server, similar to standalone mode. > > Actually, I'm not quite sure if we should set this arg. Sharing one TCP > connection is a reasonable configuration, and it's an advantage of gRPC[2]. > > Per my test, most CPU cycles are spent in kernel mode doing networking and > data transfer. Maybe better solution is to leverage modern network techniques > like RDMA or user mode stack for higher performance. > > [1] > https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gaa49ebd41af390c78a2c0ed94b74abfbc > [2] https://platformlab.stanford.edu/Seminar Talks/gRPC.pdf, page5 > > diff --git a/cpp/src/arrow/flight/client.cc b/cpp/src/arrow/flight/client.cc > index d530093d9..6904640d3 100644 > --- a/cpp/src/arrow/flight/client.cc > +++ b/cpp/src/arrow/flight/client.cc > @@ -811,6 +811,9 @@ class FlightClient::FlightClientImpl { > args.SetInt(GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS, 100); > // Receive messages of any size > args.SetMaxReceiveMessageSize(-1); > > - // Setting this arg enables each client to open it's own TCP connection > to server, > - // not sharing one single connection, which becomes bottleneck under high > load. > - args.SetInt(GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL, 1); > > if (options.override_hostname != "") { > args.SetSslTargetNameOverride(options.override_hostname); > > On 6/15/20 10:00 PM, Wes McKinney wrote: > > > > On Mon, Jun 15, 2020 at 8:43 AM Antoine Pitrou anto...@python.org wrote: > > > > > Le 15/06/2020 à 15:36, Wes McKinney a écrit : > > > > > > > When you have only a single server, all the gRPC traffic goes through > > > > a common port and is handled by a common server, so if both client and > > > > server are roughly IO bound you aren't going to get better performance > > > > by hitting the server with multiple clients simultaneously, only worse > > > > because the packets from different client requests are intermingled in > > > > the TCP traffic on that port. I'm not a networking expert but this is > > > > my best understanding of what is going on. > > > > > > Yibo Cai's experiment disproves that explanation, though. > > > When I run a single client against the test server, I get ~4 GB/s. When > > > I run 6 standalone clients against the same test server, I get ~8 GB/s > > > aggregate. So there's something else going on that limits scalability > > > when the benchmark executable runs all clients by itself (perhaps gRPC > > > clients in a single process share some underlying structure or execution > > > threads? I don't know). > > > > I see, thanks. OK then clearly something else is going on. > > > > > > I hope someone will implement the "multiple test servers" TODO in the > > > > benchmark. > > > > > > I think that's a bad idea in any case, as running multiple servers on > > > different ports is not a realistic expectation from users. > > > Regards > > > Antoine.