Find a way to achieve reasonable benchmark result with multiple threads. Diff 
pasted below for a quick review or try.
Tested on E5-2650, with this change:
num_threads = 1, speed = 1996
num_threads = 2, speed = 3555
num_threads = 4, speed = 5828

When running `arrow_flight_benchmark`, I find there's only one TCP connection 
between client and server, no matter what `num_threads` is. All clients share 
one TCP connection. At server side, I see only one thread is processing network 
packets. On my machine, one client already saturates a CPU core, so it becomes 
worse when `num_threads` increase, as that single server thread becomes 
bottleneck.

If running in standalone mode, flight clients are from different processes and 
have their own TCP connections to the server. There're separated server threads 
handling network traffics for each connection, without a central bottleneck.

I'm lucky to find arg GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL[1] just before give 
up. Setting that arg makes each client establishes its own TCP connection to 
the server, similar to standalone mode.

Actually, I'm not quite sure if we should set this arg. Sharing one TCP 
connection is a reasonable configuration, and it's an advantage of gRPC[2].

Per my test, most CPU cycles are spent in kernel mode doing networking and data 
transfer. Maybe better solution is to leverage modern network techniques like 
RDMA or user mode stack for higher performance.

[1] 
https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gaa49ebd41af390c78a2c0ed94b74abfbc
[2] https://platformlab.stanford.edu/Seminar%20Talks/gRPC.pdf, page5


diff --git a/cpp/src/arrow/flight/client.cc b/cpp/src/arrow/flight/client.cc
index d530093d9..6904640d3 100644
--- a/cpp/src/arrow/flight/client.cc
+++ b/cpp/src/arrow/flight/client.cc
@@ -811,6 +811,9 @@ class FlightClient::FlightClientImpl {
     args.SetInt(GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS, 100);
     // Receive messages of any size
     args.SetMaxReceiveMessageSize(-1);
+    // Setting this arg enables each client to open it's own TCP connection to 
server,
+    // not sharing one single connection, which becomes bottleneck under high 
load.
+    args.SetInt(GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL, 1);
if (options.override_hostname != "") {
       args.SetSslTargetNameOverride(options.override_hostname);


On 6/15/20 10:00 PM, Wes McKinney wrote:
On Mon, Jun 15, 2020 at 8:43 AM Antoine Pitrou <anto...@python.org> wrote:


Le 15/06/2020 à 15:36, Wes McKinney a écrit :

When you have only a single server, all the gRPC traffic goes through
a common port and is handled by a common server, so if both client and
server are roughly IO bound you aren't going to get better performance
by hitting the server with multiple clients simultaneously, only worse
because the packets from different client requests are intermingled in
the TCP traffic on that port. I'm not a networking expert but this is
my best understanding of what is going on.

Yibo Cai's experiment disproves that explanation, though.

When I run a single client against the test server, I get ~4 GB/s.  When
I run 6 standalone clients against the *same* test server, I get ~8 GB/s
aggregate.  So there's something else going on that limits scalability
when the benchmark executable runs all clients by itself (perhaps gRPC
clients in a single process share some underlying structure or execution
threads? I don't know).


I see, thanks. OK then clearly something else is going on.

I hope someone will implement the "multiple test servers" TODO in the
benchmark.

I think that's a bad idea *in any case*, as running multiple servers on
different ports is not a realistic expectation from users.

Regards

Antoine.

Reply via email to