[jira] [Commented] (IMPALA-11814) TransmitData RPC become slow when multiple large queries are executed simultaneously

Michael Smith (Jira) Thu, 02 Mar 2023 09:54:06 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-11814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695847#comment-17695847
 ]


Michael Smith commented on IMPALA-11814:
----------------------------------------

The most likely candidates here are: CPU starvation, network starvation, or 
lock contention.

You can look at something like pidstat (i.e. 
https://access.redhat.com/solutions/69271) to look at CPU context switching and 
top for high CPU activity.

Something like iftop could be a starting point for looking at network 
saturation.

If neither are maxed out, there are a number of tools for analyzing lock 
contention
- https://docs.kernel.org/locking/lockstat.html - usually requires recompiling 
the kernel to add CONFIG_LOCK_STAT
- https://github.com/isc-projects/mutrace
- 
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/systemtap_beginners_guide/futexcontentionsect
- perf sampling and look at repeated locks


> TransmitData RPC become slow when multiple large queries are executed 
> simultaneously
> ------------------------------------------------------------------------------------
>
>                 Key: IMPALA-11814
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11814
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.0.0
>            Reporter: YifanZhang
>            Priority: Major
>              Labels: rpc
>
> We have seen a lot of slow TransmitData RPC in our clusters when multiple 
> large queries are running. Logs in executors:
> {code:java}
> 1227 16:44:24.859488  1134 rpcz_store.cc:269] Call 
> impala.DataStreamService.TransmitData from 11.174.152.179:53824 (request call 
> id 10458) took 368799ms. Request Metrics: {}
> I1227 16:44:24.859828  1134 rpcz_store.cc:273] Trace:
> 1227 16:38:16.060450 (+     0us) impala-service-pool.cc:179] Inserting onto 
> call queue
> 1227 16:38:16.060480 (+    30us) impala-service-pool.cc:278] Handling call
> 1227 16:38:16.060498 (+    18us) krpc-data-stream-recvr.cc:325] Enqueuing 
> deferred RPC
> 1227 16:44:24.859372 (+368798874us) krpc-data-stream-recvr.cc:504] Processing 
> deferred RPC
> 1227 16:44:24.859382 (+    10us) krpc-data-stream-recvr.cc:397] Deserializing 
> batch
> 1227 16:44:24.859442 (+    60us) krpc-data-stream-recvr.cc:424] Enqueuing 
> deserialized batch
> 1227 16:44:24.859451 (+     9us) inbound_call.cc:162] Queueing success 
> response
> Metrics: {} {code}
> The RPC spends a lot of time waiting in the queue. Some other executors also 
> take a long time to transmit data:
> {code:java}
> I1227 16:44:23.379583   528 krpc-data-stream-sender.cc:396] Slow TransmitData 
> RPC to 11.136.184.52:50257 
> (fragment_instance_id=f9477dcd54c9d817:1bc346de000001e7): took 5m37s. 
> Receiver time: 5m37s Network time: 1.343ms
> I1227 16:44:23.396365   631 krpc-data-stream-sender.cc:396] Slow TransmitData 
> RPC to 11.174.158.152:58535 
> (fragment_instance_id=f9477dcd54c9d817:1bc346de00000231): took 5m57s. 
> Receiver time: 5m57s Network time: 2.970ms
> I1227 16:44:23.396482  4654 krpc-data-stream-sender.cc:430] 
> f9477dcd54c9d817:1bc346de000001b6] Long delay waiting for RPC to 
> 11.174.158.152:58535 
> (fragment_instance_id=f9477dcd54c9d817:1bc346de00000231): took 5m57s
> I1227 16:44:23.444633   572 krpc-data-stream-sender.cc:396] Slow TransmitData 
> RPC to 11.174.153.240:60303 
> (fragment_instance_id=f9477dcd54c9d817:1bc346de000002c3): took 5m53s. 
> Receiver time: 5m53s Network time: 17.604ms{code}
> The max handler latency time is also very long:
> |*Method: _TransmitData_*|
> |Handler Latency|Count: 2321637, min / max: 3.000us / 5m10s, 25th %-ile: 
> 22.000us, 50th %-ile: 38.000us, 75th %-ile: 55.000us, 90th %-ile: 72.000us, 
> 95th %-ile: 86.000us, 99.9th %-ile: 4.512ms|
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-11814) TransmitData RPC become slow when multiple large queries are executed simultaneously

Reply via email to