On 6/17/20 8:33 PM, David Li wrote:
------ Tessian Warning ------

There is something unusual about this email, please take care as it could be 
malicious.

Tessian has flagged this email because the sender could be trying to impersonate someone at your company. The 
sender, "David Li <li[.]davidm96@gmail[.]com>", looks similar to "david lim 
<dave[.]lim@arm[.]com>", someone at your company.


COVID-19 update: Phishing attacks are increasing to take advantage of the 
current situation. Try contacting the sender through a different channel to 
confirm that the email was sent by them.

This warning message will be removed if you reply to or forward this email to a 
recipient outside of your organization.

---- Tessian Warning End ----

Hey Yibo,

Thanks for investigating this! This is a great writeup.

There was a PR recently to let clients set gRPC options like this, so
it can be enabled on a case-by-case basis:
https://github.com/apache/arrow/pull/7406
So we could add that to the benchmark or suggest it in documentation.

Thanks David, exactly what I want.


I think this benchmark is a bit of a pathological case for gRPC. gRPC
will share sockets when all client options are exactly the same; it
seems just adding TLS, for instance, would break that (unless you
intentionally shared TLS credentials, which Flight doesn't):
https://github.com/grpc/grpc/issues/15207. I believe grpc-java doesn't
have this behavior (different Channel instances won't share
connections).

Also, did you investigate the SO_ZEROCOPY flag gRPC now offers? I
wonder if that might also help performance a bit.
https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga1eb58c302eaf27a5d982b30402b8f84a

Did a quick try. Ubuntu 4.15 kernel doesn't support SO_ZEROCOPY. I upgraded to 
5.7 kernel.
Per my test on same host, no obvious difference after setting this option. Will 
do more tests over network.


Best,
David

On 6/17/20, Chengxin Ma <[email protected]> wrote:
Hi Yibo,


Your discovery is impressive.


Did you consider the `num_streams` parameter [1] as well? If I understood
correctly, this parameter is used for setting the conceptual concurrent
streams between the client and the server, while `num_threads` is used for
setting the size of the thread pool that actually handles these streams [2].
By default, both of the two parameters are 4.


As for CPU usage, the parameter `records_per_batch`[3] has an impact as
well. If you increase the value of this parameter, you will probably see
that the data transfer speed increased while the server-side CPU usage
dropped [4].
My guess is that as more records are put in one record batch, the total
number of batches would decrease. CPU is only used for (de)serializing the
metadata (i.e. schema) of each record batch while the payload can be
transferred with zero cost [5].


[1] 
https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L43
[2] 
https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L230
[3] 
https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L46
[4] 
https://drive.google.com/file/d/1aH84DdenLr0iH-RuMFU3_q87nPE_HLmP/view?usp=sharing
[5] See "Optimizing Data Throughput over gRPC"
in https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/


Kind Regards
Chengxin


Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Wednesday, June 17, 2020 8:35 AM, Yibo Cai <[email protected]> wrote:

Find a way to achieve reasonable benchmark result with multiple threads.
Diff pasted below for a quick review or try.
Tested on E5-2650, with this change:
num_threads = 1, speed = 1996
num_threads = 2, speed = 3555
num_threads = 4, speed = 5828

When running `arrow_flight_benchmark`, I find there's only one TCP
connection between client and server, no matter what `num_threads` is. All
clients share one TCP connection. At server side, I see only one thread is
processing network packets. On my machine, one client already saturates a
CPU core, so it becomes worse when `num_threads` increase, as that single
server thread becomes bottleneck.

If running in standalone mode, flight clients are from different processes
and have their own TCP connections to the server. There're separated
server threads handling network traffics for each connection, without a
central bottleneck.

I'm lucky to find arg GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL[1] just before
give up. Setting that arg makes each client establishes its own TCP
connection to the server, similar to standalone mode.

Actually, I'm not quite sure if we should set this arg. Sharing one TCP
connection is a reasonable configuration, and it's an advantage of
gRPC[2].

Per my test, most CPU cycles are spent in kernel mode doing networking and
data transfer. Maybe better solution is to leverage modern network
techniques like RDMA or user mode stack for higher performance.

[1]
https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gaa49ebd41af390c78a2c0ed94b74abfbc
[2] https://platformlab.stanford.edu/Seminar Talks/gRPC.pdf, page5

diff --git a/cpp/src/arrow/flight/client.cc
b/cpp/src/arrow/flight/client.cc
index d530093d9..6904640d3 100644
--- a/cpp/src/arrow/flight/client.cc
+++ b/cpp/src/arrow/flight/client.cc
@@ -811,6 +811,9 @@ class FlightClient::FlightClientImpl {
args.SetInt(GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS, 100);
// Receive messages of any size
args.SetMaxReceiveMessageSize(-1);

-   // Setting this arg enables each client to open it's own TCP
connection to server,
-   // not sharing one single connection, which becomes bottleneck under
high load.
-   args.SetInt(GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL, 1);

     if (options.override_hostname != "") {
     args.SetSslTargetNameOverride(options.override_hostname);

     On 6/15/20 10:00 PM, Wes McKinney wrote:


On Mon, Jun 15, 2020 at 8:43 AM Antoine Pitrou [email protected]
wrote:

Le 15/06/2020 à 15:36, Wes McKinney a écrit :

When you have only a single server, all the gRPC traffic goes
through
a common port and is handled by a common server, so if both client
and
server are roughly IO bound you aren't going to get better
performance
by hitting the server with multiple clients simultaneously, only
worse
because the packets from different client requests are intermingled
in
the TCP traffic on that port. I'm not a networking expert but this
is
my best understanding of what is going on.

Yibo Cai's experiment disproves that explanation, though.
When I run a single client against the test server, I get ~4 GB/s.
When
I run 6 standalone clients against the same test server, I get ~8
GB/s
aggregate. So there's something else going on that limits scalability
when the benchmark executable runs all clients by itself (perhaps
gRPC
clients in a single process share some underlying structure or
execution
threads? I don't know).

I see, thanks. OK then clearly something else is going on.

I hope someone will implement the "multiple test servers" TODO in
the
benchmark.

I think that's a bad idea in any case, as running multiple servers on
different ports is not a realistic expectation from users.
Regards
Antoine.




Reply via email to