Find a way to achieve reasonable benchmark result with multiple threads. Diff
pasted below for a quick review or try.
Tested on E5-2650, with this change:
num_threads = 1, speed = 1996
num_threads = 2, speed = 3555
num_threads = 4, speed = 5828
When running `arrow_flight_benchmark`, I find there's only one TCP connection
between client and server, no matter what `num_threads` is. All clients share
one TCP connection. At server side, I see only one thread is processing network
packets. On my machine, one client already saturates a CPU core, so it becomes
worse when `num_threads` increase, as that single server thread becomes
bottleneck.
If running in standalone mode, flight clients are from different processes and
have their own TCP connections to the server. There're separated server threads
handling network traffics for each connection, without a central bottleneck.
I'm lucky to find arg GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL[1] just before give
up. Setting that arg makes each client establishes its own TCP connection to
the server, similar to standalone mode.
Actually, I'm not quite sure if we should set this arg. Sharing one TCP
connection is a reasonable configuration, and it's an advantage of gRPC[2].
Per my test, most CPU cycles are spent in kernel mode doing networking and data
transfer. Maybe better solution is to leverage modern network techniques like
RDMA or user mode stack for higher performance.
[1]
https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gaa49ebd41af390c78a2c0ed94b74abfbc
[2] https://platformlab.stanford.edu/Seminar%20Talks/gRPC.pdf, page5
diff --git a/cpp/src/arrow/flight/client.cc b/cpp/src/arrow/flight/client.cc
index d530093d9..6904640d3 100644
--- a/cpp/src/arrow/flight/client.cc
+++ b/cpp/src/arrow/flight/client.cc
@@ -811,6 +811,9 @@ class FlightClient::FlightClientImpl {
args.SetInt(GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS, 100);
// Receive messages of any size
args.SetMaxReceiveMessageSize(-1);
+ // Setting this arg enables each client to open it's own TCP connection to
server,
+ // not sharing one single connection, which becomes bottleneck under high
load.
+ args.SetInt(GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL, 1);
if (options.override_hostname != "") {
args.SetSslTargetNameOverride(options.override_hostname);
On 6/15/20 10:00 PM, Wes McKinney wrote:
On Mon, Jun 15, 2020 at 8:43 AM Antoine Pitrou <anto...@python.org> wrote:
Le 15/06/2020 à 15:36, Wes McKinney a écrit :
When you have only a single server, all the gRPC traffic goes through
a common port and is handled by a common server, so if both client and
server are roughly IO bound you aren't going to get better performance
by hitting the server with multiple clients simultaneously, only worse
because the packets from different client requests are intermingled in
the TCP traffic on that port. I'm not a networking expert but this is
my best understanding of what is going on.
Yibo Cai's experiment disproves that explanation, though.
When I run a single client against the test server, I get ~4 GB/s. When
I run 6 standalone clients against the *same* test server, I get ~8 GB/s
aggregate. So there's something else going on that limits scalability
when the benchmark executable runs all clients by itself (perhaps gRPC
clients in a single process share some underlying structure or execution
threads? I don't know).
I see, thanks. OK then clearly something else is going on.
I hope someone will implement the "multiple test servers" TODO in the
benchmark.
I think that's a bad idea *in any case*, as running multiple servers on
different ports is not a realistic expectation from users.
Regards
Antoine.