put streams asynchronously helps performance

David Li (Jira) Mon, 26 Apr 2021 07:48:08 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332470#comment-17332470
 ]


David Li commented on ARROW-10351:
----------------------------------

Sure, serializing the batches is otherwise too cheap to justify the thread. 
However, if compression + background thread can't outperform no compression at 
all, then there's little point to compression in the first place.

I tested two EC2 t3.xlarge instances (4 cores, 16 GB RAM). They should have ~5 
Gbps of bandwidth between them. All benchmarks were run as 
{{./release/arrow-flight-benchmark -test_put -num_perf_runs=4 -num_streams=4 
-num_threads=1 -server_host=(host)}}.

With compression, with background thread:
{noformat}
Using standalone TCP server
Server host: ip-172-31-68-128.ec2.internal
Server port: 31337
Testing method: DoPut
Number of perf runs: 4
Number of concurrent gets/puts: 1
Batch size: 131040
Batches written: 39072
Bytes written: 5120000000
Nanos: 9203507933
Speed: 530.538 MB/s
Throughput: 4245.34 batches/s
Latency mean: 230 us
Latency quantile=0.5: 182 us
Latency quantile=0.95: 392 us
Latency quantile=0.99: 1411 us
Latency max: 11809 us
{noformat}
With compression, without background thread:
{noformat}
Using standalone TCP server
Server host: ip-172-31-68-128.ec2.internal
Server port: 31337
Testing method: DoPut
Number of perf runs: 4
Number of concurrent gets/puts: 1
Batch size: 131040
Batches written: 39072
Bytes written: 5120000000
Nanos: 9256189526
Speed: 527.519 MB/s
Throughput: 4221.18 batches/s
Latency mean: 232 us
Latency quantile=0.5: 195 us
Latency quantile=0.95: 328 us
Latency quantile=0.99: 874 us
Latency max: 20200 us{noformat}
Without compression, without background thread:
{noformat}
Using standalone TCP server
Server host: ip-172-31-68-128.ec2.internal
Server port: 31337
Testing method: DoPut
Number of perf runs: 4
Number of concurrent gets/puts: 1
Batch size: 131040
Batches written: 39072
Bytes written: 5120000000
Nanos: 8678223134
Speed: 562.651 MB/s
Throughput: 4502.3 batches/s
Latency mean: 216 us
Latency quantile=0.5: 55 us
Latency quantile=0.95: 1556 us
Latency quantile=0.99: 2806 us
Latency max: 21395 us{noformat}
In short, for Flight, it seems compression is simply not worth it, regardless 
of whether there's a background thread or not. This tradeoff may change when 
there's less bandwidth available. It does seem p99 latency is better.

And there are other factors. For instance, benchmark uses random data which may 
not compress well; a different dataset may perform better. ZSTD is relatively 
fast, but here we aren't tuning it for compression/decompression speed.

> [C++][Flight] See if reading/writing to gRPC get/put streams asynchronously 
> helps performance
> ---------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10351
>                 URL: https://issues.apache.org/jira/browse/ARROW-10351
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, FlightRPC
>            Reporter: Wes McKinney
>            Priority: Major
>
> We don't use any asynchronous concepts in the way that Flight is implemented 
> now, i.e. IPC deconstruction/reconstruction (which may include compression!) 
> is not performed concurrent with moving FlightData objects through the gRPC 
> machinery, which may yield suboptimal performance. 
> It might be better to apply an actor-type approach where a dedicated thread 
> retrieves and prepares the next raw IPC message (within a Future) while the 
> current IPC message is being processed -- that way reading/writing to/from 
> the gRPC stream is not blocked on the IPC code doing its thing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10351) [C++][Flight] See if reading/writing to gRPC get/put streams asynchronously helps performance

Reply via email to