[
https://issues.apache.org/jira/browse/KUDU-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mitch Barnett updated KUDU-2762:
--------------------------------
Status: In Review (was: In Progress)
> Improve write transaction tracing
> ---------------------------------
>
> Key: KUDU-2762
> URL: https://issues.apache.org/jira/browse/KUDU-2762
> Project: Kudu
> Issue Type: Improvement
> Reporter: Will Berkeley
> Assignee: Mitch Barnett
> Priority: Major
>
> Here's a write transaction trace:
> {noformat}
> W0331 02:16:27.859648 27337 rpcz_store.cc:244] Call
> kudu.tserver.TabletServerService.Write from 10.80.34.74:58250 (ReqId={client:
> 6a904be8604b482989e3d1592f8824f2, seq_no=66108, attempt_no=1}) took 7855ms
> (client timeout 10000).
> W0331 02:16:27.859728 27337 rpcz_store.cc:248] Trace:
> 0331 02:16:20.003835 (+ 0us) service_pool.cc:159] Inserting onto call
> queue
> 0331 02:16:20.003844 (+ 9us) service_pool.cc:218] Handling call
> 0331 02:16:27.859641 (+7855797us) inbound_call.cc:157] Queueing success
> response
> Related trace 'txn':
> 0331 02:16:20.003988 (+ 0us) write_transaction.cc:101] PREPARE: Starting
> 0331 02:16:20.004087 (+ 99us) write_transaction.cc:268] Acquiring schema
> lock in shared mode
> 0331 02:16:20.004088 (+ 1us) write_transaction.cc:271] Acquired schema
> lock
> 0331 02:16:20.004088 (+ 0us) tablet.cc:400] PREPARE: Decoding operations
> 0331 02:16:20.004130 (+ 42us) tablet.cc:422] PREPARE: Acquiring locks for
> 1 operations
> 0331 02:16:20.004137 (+ 7us) tablet.cc:426] PREPARE: locks acquired
> 0331 02:16:20.004138 (+ 1us) write_transaction.cc:126] PREPARE: finished.
> 0331 02:16:20.004154 (+ 16us) write_transaction.cc:136] Start()
> 0331 02:16:20.004157 (+ 3us) write_transaction.cc:141] Timestamp: P:
> 1554016580004153 usec, L: 0
> 0331 02:16:20.004192 (+ 35us) log.cc:582] Serialized 3741 byte log entry
> 0331 02:16:27.859496 (+7855304us) write_transaction.cc:149] APPLY: Starting
> 0331 02:16:27.859608 (+ 112us) tablet_metrics.cc:365] ProbeStats:
> bloom_lookups=2,key_file_lookups=2,delta_file_lookups=4,mrs_lookups=0
> 0331 02:16:27.859614 (+ 6us) log.cc:582] Serialized 28 byte log entry
> 0331 02:16:27.859622 (+ 8us) write_transaction.cc:309] Releasing row and
> schema locks
> 0331 02:16:27.859623 (+ 1us) write_transaction.cc:277] Released schema
> lock
> 0331 02:16:27.859625 (+ 2us) write_transaction.cc:196] FINISH: updating
> metrics
> Metrics:
> {"tcmalloc_contention_cycles":7552,"child_traces":[["txn",{"apply.queue_time_us":7854429,"cfile_cache_hit":12,"cfile_cache_hit_bytes":72423,"delta_iterators_relevant":2,"num_ops":1,"prepare.queue_time_us":11,"prepare.run_cpu_time_us":226,"prepare.run_wall_time_us":225,"raft.queue_time_us":12,"raft.run_cpu_time_us":92,"raft.run_wall_time_us":91,"replication_time_us":898,"spinlock_wait_cycles":18688}]]}
> {noformat}
> It could use some polish. Here's a few things to fix:
> 1. Nit: the casing and punctuation of {{PREPARE: finished.}} and {{APPLY:
> Starting}} is inconsistent.
> 2. {{PREPARE}} has a start and end; {{APPLY}} has a start but no end;
> replication has neither. We should add the missing trace events.
> 3. nit: There's no need for the {{PREPARE:}} preamble to trace events within
> {{PREPARE: Starting}} and {{PREPARE: finished.}}.
> It's potentially misleading to read this trace. Almost all of the time is
> spent between the trace event reporting the local WAL was written and the
> trace event reporting apply started. Naively, this should mean the operation
> took a long time to replicate, as that's the most obvious thing that could
> take a long time. However, the metrics show the time was taken waiting in the
> apply queue, held up by some other slow activity. This operation was fast to
> perform-- it took a long time because it was held up, idling, by something
> else.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)