YifanZhang created IMPALA-11814:
-----------------------------------
Summary: TransmitData RPC become slow when multiple large queries
are executed simultaneously
Key: IMPALA-11814
URL: https://issues.apache.org/jira/browse/IMPALA-11814
Project: IMPALA
Issue Type: Bug
Reporter: YifanZhang
We have seen a lot of slow TransmitData RPC in our clusters when multiple large
queries are running. Logs in executors:
{code:java}
1227 16:44:24.859488 1134 rpcz_store.cc:269] Call
impala.DataStreamService.TransmitData from 11.174.152.179:53824 (request call
id 10458) took 368799ms. Request Metrics: {}
I1227 16:44:24.859828 1134 rpcz_store.cc:273] Trace:
1227 16:38:16.060450 (+ 0us) impala-service-pool.cc:179] Inserting onto
call queue
1227 16:38:16.060480 (+ 30us) impala-service-pool.cc:278] Handling call
1227 16:38:16.060498 (+ 18us) krpc-data-stream-recvr.cc:325] Enqueuing
deferred RPC
1227 16:44:24.859372 (+368798874us) krpc-data-stream-recvr.cc:504] Processing
deferred RPC
1227 16:44:24.859382 (+ 10us) krpc-data-stream-recvr.cc:397] Deserializing
batch
1227 16:44:24.859442 (+ 60us) krpc-data-stream-recvr.cc:424] Enqueuing
deserialized batch
1227 16:44:24.859451 (+ 9us) inbound_call.cc:162] Queueing success response
Metrics: {} {code}
The RPC spends a lot of time waiting in the queue. Some other executors also
take a long time to transmit data:
{code:java}
I1227 16:44:23.379583 528 krpc-data-stream-sender.cc:396] Slow TransmitData
RPC to 11.136.184.52:50257
(fragment_instance_id=f9477dcd54c9d817:1bc346de000001e7): took 5m37s. Receiver
time: 5m37s Network time: 1.343ms
I1227 16:44:23.396365 631 krpc-data-stream-sender.cc:396] Slow TransmitData
RPC to 11.174.158.152:58535
(fragment_instance_id=f9477dcd54c9d817:1bc346de00000231): took 5m57s. Receiver
time: 5m57s Network time: 2.970ms
I1227 16:44:23.396482 4654 krpc-data-stream-sender.cc:430]
f9477dcd54c9d817:1bc346de000001b6] Long delay waiting for RPC to
11.174.158.152:58535 (fragment_instance_id=f9477dcd54c9d817:1bc346de00000231):
took 5m57s
I1227 16:44:23.444633 572 krpc-data-stream-sender.cc:396] Slow TransmitData
RPC to 11.174.153.240:60303
(fragment_instance_id=f9477dcd54c9d817:1bc346de000002c3): took 5m53s. Receiver
time: 5m53s Network time: 17.604ms{code}
The max handler latency time is also very long:
|*Method: _TransmitData_*|
|Handler Latency|Count: 2321637, min / max: 3.000us / 5m10s, 25th %-ile:
22.000us, 50th %-ile: 38.000us, 75th %-ile: 55.000us, 90th %-ile: 72.000us,
95th %-ile: 86.000us, 99.9th %-ile: 4.512ms|
--
This message was sent by Atlassian Jira
(v8.20.10#820010)