[
https://issues.apache.org/jira/browse/IMPALA-11814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695847#comment-17695847
]
Michael Smith commented on IMPALA-11814:
----------------------------------------
The most likely candidates here are: CPU starvation, network starvation, or
lock contention.
You can look at something like pidstat (i.e.
https://access.redhat.com/solutions/69271) to look at CPU context switching and
top for high CPU activity.
Something like iftop could be a starting point for looking at network
saturation.
If neither are maxed out, there are a number of tools for analyzing lock
contention
- https://docs.kernel.org/locking/lockstat.html - usually requires recompiling
the kernel to add CONFIG_LOCK_STAT
- https://github.com/isc-projects/mutrace
-
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/systemtap_beginners_guide/futexcontentionsect
- perf sampling and look at repeated locks
> TransmitData RPC become slow when multiple large queries are executed
> simultaneously
> ------------------------------------------------------------------------------------
>
> Key: IMPALA-11814
> URL: https://issues.apache.org/jira/browse/IMPALA-11814
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.0.0
> Reporter: YifanZhang
> Priority: Major
> Labels: rpc
>
> We have seen a lot of slow TransmitData RPC in our clusters when multiple
> large queries are running. Logs in executors:
> {code:java}
> 1227 16:44:24.859488 1134 rpcz_store.cc:269] Call
> impala.DataStreamService.TransmitData from 11.174.152.179:53824 (request call
> id 10458) took 368799ms. Request Metrics: {}
> I1227 16:44:24.859828 1134 rpcz_store.cc:273] Trace:
> 1227 16:38:16.060450 (+ 0us) impala-service-pool.cc:179] Inserting onto
> call queue
> 1227 16:38:16.060480 (+ 30us) impala-service-pool.cc:278] Handling call
> 1227 16:38:16.060498 (+ 18us) krpc-data-stream-recvr.cc:325] Enqueuing
> deferred RPC
> 1227 16:44:24.859372 (+368798874us) krpc-data-stream-recvr.cc:504] Processing
> deferred RPC
> 1227 16:44:24.859382 (+ 10us) krpc-data-stream-recvr.cc:397] Deserializing
> batch
> 1227 16:44:24.859442 (+ 60us) krpc-data-stream-recvr.cc:424] Enqueuing
> deserialized batch
> 1227 16:44:24.859451 (+ 9us) inbound_call.cc:162] Queueing success
> response
> Metrics: {} {code}
> The RPC spends a lot of time waiting in the queue. Some other executors also
> take a long time to transmit data:
> {code:java}
> I1227 16:44:23.379583 528 krpc-data-stream-sender.cc:396] Slow TransmitData
> RPC to 11.136.184.52:50257
> (fragment_instance_id=f9477dcd54c9d817:1bc346de000001e7): took 5m37s.
> Receiver time: 5m37s Network time: 1.343ms
> I1227 16:44:23.396365 631 krpc-data-stream-sender.cc:396] Slow TransmitData
> RPC to 11.174.158.152:58535
> (fragment_instance_id=f9477dcd54c9d817:1bc346de00000231): took 5m57s.
> Receiver time: 5m57s Network time: 2.970ms
> I1227 16:44:23.396482 4654 krpc-data-stream-sender.cc:430]
> f9477dcd54c9d817:1bc346de000001b6] Long delay waiting for RPC to
> 11.174.158.152:58535
> (fragment_instance_id=f9477dcd54c9d817:1bc346de00000231): took 5m57s
> I1227 16:44:23.444633 572 krpc-data-stream-sender.cc:396] Slow TransmitData
> RPC to 11.174.153.240:60303
> (fragment_instance_id=f9477dcd54c9d817:1bc346de000002c3): took 5m53s.
> Receiver time: 5m53s Network time: 17.604ms{code}
> The max handler latency time is also very long:
> |*Method: _TransmitData_*|
> |Handler Latency|Count: 2321637, min / max: 3.000us / 5m10s, 25th %-ile:
> 22.000us, 50th %-ile: 38.000us, 75th %-ile: 55.000us, 90th %-ile: 72.000us,
> 95th %-ile: 86.000us, 99.9th %-ile: 4.512ms|
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]