[ 
https://issues.apache.org/jira/browse/IMPALA-9128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974582#comment-16974582
 ] 

ASF subversion and git services commented on IMPALA-9128:
---------------------------------------------------------

Commit cf224f8461553e41a095404235eb8f58f795eedd in impala's branch 
refs/heads/master from Tim Armstrong
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=cf224f8 ]

IMPALA-9128: part 2: dump traces for slow RPCs

This adds trace events for data stream RPCs and
dumps them when they take longer than
--impala_slow_rpc_threshold_ms.

I needed to modify the KRPC code to do this because it
currently only dumps traces for RPCs with deadlines.
I plan to add some version of this upstream in Kudu
so that we don't diverge our KRPC implementation.

Example output from test_exchange_small_buffer:

I1111 08:38:53.732910 26509 rpcz_store.cc:265] Call 
impala.DataStreamService.TransmitData from 127.0.0.1:42434 (request call id 43) 
took 7799ms. Request Metrics: {}
I1111 08:38:53.732928 26509 rpcz_store.cc:269] Trace:
1111 08:38:45.933412 (+     0us) impala-service-pool.cc:167] Inserting onto 
call queue
1111 08:38:45.933449 (+    37us) impala-service-pool.cc:254] Handling call
1111 08:38:45.933470 (+    21us) krpc-data-stream-mgr.cc:227] Added early sender
1111 08:38:47.906542 (+1973072us) krpc-data-stream-recvr.cc:327] Enqueuing 
deferred RPC
1111 08:38:53.732858 (+5826316us) krpc-data-stream-recvr.cc:506] Processing 
deferred RPC
1111 08:38:53.732860 (+     2us) krpc-data-stream-recvr.cc:399] Deserializing 
batch
1111 08:38:53.732888 (+    28us) krpc-data-stream-recvr.cc:426] Enqueuing 
deserialized batch
1111 08:38:53.732895 (+     7us) inbound_call.cc:162] Queueing success response

Disabled +-clang-diagnostic-gnu-zero-variadic-macro-arguments because it
had false positives on the TRACE_TO invocations.

Testing:
* Ran exhaustive and ASAN tests
* Ran stress test

Change-Id: Ic7af4b45c43ec731d742d3696112c5f800849947
Reviewed-on: http://gerrit.cloudera.org:8080/14668
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Improve debugging for slow sends in KrpcDataStreamSender
> --------------------------------------------------------
>
>                 Key: IMPALA-9128
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9128
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Major
>              Labels: observability
>
> I'm trying to debug a problem that appears to be caused by a slow RPC:
> {noformat}
>     Fragment F00
>       Instance 754fc21ba4744310:d58fd04200000020 (host=xxxxxxxxx)
>         Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/120.48 
> MB 
>         - AverageThreadTokens: 1.00 (1.0)
>         - BloomFilterBytes: 0 B (0)
>         - InactiveTotalTime: 0ns (0)
>         - PeakMemoryUsage: 3.2 MiB (3337546)
>         - PeakReservation: 2.0 MiB (2097152)
>         - PeakUsedReservation: 0 B (0)
>         - PerHostPeakMemUsage: 6.7 MiB (6987376)
>         - RowsProduced: 7 (7)
>         - TotalNetworkReceiveTime: 0ns (0)
>         - TotalNetworkSendTime: 3.6m (215354065071)
>         - TotalStorageWaitTime: 4ms (4552708)
>         - TotalThreadsInvoluntaryContextSwitches: 2 (2)
>         - TotalThreadsTotalWallClockTime: 3.6m (215924079474)
>           - TotalThreadsSysTime: 24ms (24386000)
>           - TotalThreadsUserTime: 505ms (505714000)
>         - TotalThreadsVoluntaryContextSwitches: 3,623 (3623)
>         - TotalTime: 3.6m (215801961705)
>         Fragment Instance Lifecycle Event Timeline
>           Prepare Finished: 1ms (1812344)
>           Open Finished: 322ms (322905753)
>           First Batch Produced: 447ms (447050377)
>           First Batch Sent: 447ms (447054546)
>           ExecInternal Finished: 3.6m (215802284852)
>         Buffer pool
>           - AllocTime: 0ns (0)
>           - CumulativeAllocationBytes: 0 B (0)
>           - CumulativeAllocations: 0 (0)
>           - InactiveTotalTime: 0ns (0)
>           - PeakReservation: 0 B (0)
>           - PeakUnpinnedBytes: 0 B (0)
>           - PeakUsedReservation: 0 B (0)
>           - ReadIoBytes: 0 B (0)
>           - ReadIoOps: 0 (0)
>           - ReadIoWaitTime: 0ns (0)
>           - ReservationLimit: 0 B (0)
>           - TotalTime: 0ns (0)
>           - WriteIoBytes: 0 B (0)
>           - WriteIoOps: 0 (0)
>           - WriteIoWaitTime: 0ns (0)
>         Fragment Instance Lifecycle Timings
>           - ExecTime: 3.6m (215479380267)
>             - ExecTreeExecTime: 124ms (124299400)
>           - InactiveTotalTime: 0ns (0)
>           - OpenTime: 321ms (321088906)
>             - ExecTreeOpenTime: 572.04us (572045)
>           - PrepareTime: 1ms (1426412)
>             - ExecTreePrepareTime: 233.32us (233318)
>           - TotalTime: 0ns (0)
>         KrpcDataStreamSender (dst_id=3)
>           - EosSent: 58 (58)
>           - InactiveTotalTime: 3.6m (215354085858)
>           - PeakMemoryUsage: 464.4 KiB (475504)
>           - RowsSent: 7 (7)
>           - RpcFailure: 0 (0)
>           - RpcRetry: 0 (0)
>           - SerializeBatchTime: 99.87us (99867)
>           - TotalBytesSent: 207 B (207)
>           - TotalTime: 3.6m (215355336381)
>           - UncompressedRowBatchSize: 267 B (267)
> {noformat}
> We should add some diagnostics that will allow us to figure out which RPCs 
> are slow and whether there's a pattern about which host is the problem. E.g. 
> maybe we should log if the RPC time exceeds a configured threshold.
> It may also be useful to include some stats about the wait time, e.g. a 
> histogram of the wait times, so that we can see if it's an outlier or general 
> slowness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to