Will Berkeley created KUDU-2762:
-----------------------------------
Summary: Improve write transaction tracing
Key: KUDU-2762
URL: https://issues.apache.org/jira/browse/KUDU-2762
Project: Kudu
Issue Type: Improvement
Reporter: Will Berkeley
Here's a write transaction trace:
{noformat}
W0331 02:16:27.859648 27337 rpcz_store.cc:244] Call
kudu.tserver.TabletServerService.Write from 10.80.34.74:58250 (ReqId={client:
6a904be8604b482989e3d1592f8824f2, seq_no=66108, attempt_no=1}) took 7855ms
(client timeout 10000).
W0331 02:16:27.859728 27337 rpcz_store.cc:248] Trace:
0331 02:16:20.003835 (+ 0us) service_pool.cc:159] Inserting onto call queue
0331 02:16:20.003844 (+ 9us) service_pool.cc:218] Handling call
0331 02:16:27.859641 (+7855797us) inbound_call.cc:157] Queueing success response
Related trace 'txn':
0331 02:16:20.003988 (+ 0us) write_transaction.cc:101] PREPARE: Starting
0331 02:16:20.004087 (+ 99us) write_transaction.cc:268] Acquiring schema
lock in shared mode
0331 02:16:20.004088 (+ 1us) write_transaction.cc:271] Acquired schema lock
0331 02:16:20.004088 (+ 0us) tablet.cc:400] PREPARE: Decoding operations
0331 02:16:20.004130 (+ 42us) tablet.cc:422] PREPARE: Acquiring locks for 1
operations
0331 02:16:20.004137 (+ 7us) tablet.cc:426] PREPARE: locks acquired
0331 02:16:20.004138 (+ 1us) write_transaction.cc:126] PREPARE: finished.
0331 02:16:20.004154 (+ 16us) write_transaction.cc:136] Start()
0331 02:16:20.004157 (+ 3us) write_transaction.cc:141] Timestamp: P:
1554016580004153 usec, L: 0
0331 02:16:20.004192 (+ 35us) log.cc:582] Serialized 3741 byte log entry
0331 02:16:27.859496 (+7855304us) write_transaction.cc:149] APPLY: Starting
0331 02:16:27.859608 (+ 112us) tablet_metrics.cc:365] ProbeStats:
bloom_lookups=2,key_file_lookups=2,delta_file_lookups=4,mrs_lookups=0
0331 02:16:27.859614 (+ 6us) log.cc:582] Serialized 28 byte log entry
0331 02:16:27.859622 (+ 8us) write_transaction.cc:309] Releasing row and
schema locks
0331 02:16:27.859623 (+ 1us) write_transaction.cc:277] Released schema lock
0331 02:16:27.859625 (+ 2us) write_transaction.cc:196] FINISH: updating
metrics
Metrics:
{"tcmalloc_contention_cycles":7552,"child_traces":[["txn",{"apply.queue_time_us":7854429,"cfile_cache_hit":12,"cfile_cache_hit_bytes":72423,"delta_iterators_relevant":2,"num_ops":1,"prepare.queue_time_us":11,"prepare.run_cpu_time_us":226,"prepare.run_wall_time_us":225,"raft.queue_time_us":12,"raft.run_cpu_time_us":92,"raft.run_wall_time_us":91,"replication_time_us":898,"spinlock_wait_cycles":18688}]]}
{noformat}
It could use some polish. Here's a few things to fix:
1. Nit: the casing and punctuation of {{PREPARE: finished.}} and {{APPLY:
Starting}} is inconsistent.
2. {{PREPARE}} has a start and end; {{APPLY}} has a start but no end;
replication has neither. We should add the missing trace events.
3. nit: There's no need for the {{PREPARE:}} preamble to trace events within
{{PREPARE: Starting}} and {{PREPARE: finished.}}.
It's potentially misleading to read this trace. Almost all of the time is spent
between the trace event reporting the local WAL was written and the trace event
reporting apply started. Naively, this should mean the operation took a long
time to replicate, as that's the most obvious thing that could take a long
time. However, the metrics show the time was taken waiting in the apply queue,
held up by some other slow activity. This operation was fast to perform-- it
took a long time because it was held up, idling, by something else.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)