Will Berkeley created KUDU-2762:
-----------------------------------

             Summary: Improve write transaction tracing
                 Key: KUDU-2762
                 URL: https://issues.apache.org/jira/browse/KUDU-2762
             Project: Kudu
          Issue Type: Improvement
            Reporter: Will Berkeley


Here's a write transaction trace:

{noformat}
W0331 02:16:27.859648 27337 rpcz_store.cc:244] Call 
kudu.tserver.TabletServerService.Write from 10.80.34.74:58250 (ReqId={client: 
6a904be8604b482989e3d1592f8824f2, seq_no=66108, attempt_no=1}) took 7855ms 
(client timeout 10000).
W0331 02:16:27.859728 27337 rpcz_store.cc:248] Trace:
0331 02:16:20.003835 (+     0us) service_pool.cc:159] Inserting onto call queue
0331 02:16:20.003844 (+     9us) service_pool.cc:218] Handling call
0331 02:16:27.859641 (+7855797us) inbound_call.cc:157] Queueing success response
Related trace 'txn':
0331 02:16:20.003988 (+     0us) write_transaction.cc:101] PREPARE: Starting
0331 02:16:20.004087 (+    99us) write_transaction.cc:268] Acquiring schema 
lock in shared mode
0331 02:16:20.004088 (+     1us) write_transaction.cc:271] Acquired schema lock
0331 02:16:20.004088 (+     0us) tablet.cc:400] PREPARE: Decoding operations
0331 02:16:20.004130 (+    42us) tablet.cc:422] PREPARE: Acquiring locks for 1 
operations
0331 02:16:20.004137 (+     7us) tablet.cc:426] PREPARE: locks acquired
0331 02:16:20.004138 (+     1us) write_transaction.cc:126] PREPARE: finished.
0331 02:16:20.004154 (+    16us) write_transaction.cc:136] Start()
0331 02:16:20.004157 (+     3us) write_transaction.cc:141] Timestamp: P: 
1554016580004153 usec, L: 0
0331 02:16:20.004192 (+    35us) log.cc:582] Serialized 3741 byte log entry
0331 02:16:27.859496 (+7855304us) write_transaction.cc:149] APPLY: Starting
0331 02:16:27.859608 (+   112us) tablet_metrics.cc:365] ProbeStats: 
bloom_lookups=2,key_file_lookups=2,delta_file_lookups=4,mrs_lookups=0
0331 02:16:27.859614 (+     6us) log.cc:582] Serialized 28 byte log entry
0331 02:16:27.859622 (+     8us) write_transaction.cc:309] Releasing row and 
schema locks
0331 02:16:27.859623 (+     1us) write_transaction.cc:277] Released schema lock
0331 02:16:27.859625 (+     2us) write_transaction.cc:196] FINISH: updating 
metrics
Metrics: 
{"tcmalloc_contention_cycles":7552,"child_traces":[["txn",{"apply.queue_time_us":7854429,"cfile_cache_hit":12,"cfile_cache_hit_bytes":72423,"delta_iterators_relevant":2,"num_ops":1,"prepare.queue_time_us":11,"prepare.run_cpu_time_us":226,"prepare.run_wall_time_us":225,"raft.queue_time_us":12,"raft.run_cpu_time_us":92,"raft.run_wall_time_us":91,"replication_time_us":898,"spinlock_wait_cycles":18688}]]}
{noformat}

It could use some polish. Here's a few things to fix:
1. Nit: the casing and punctuation of {{PREPARE: finished.}} and {{APPLY: 
Starting}} is inconsistent.
2. {{PREPARE}} has a start and end; {{APPLY}} has a start but no end; 
replication has neither. We should add the missing trace events.
3. nit: There's no need for the {{PREPARE:}} preamble to trace events within 
{{PREPARE: Starting}} and {{PREPARE: finished.}}.

It's potentially misleading to read this trace. Almost all of the time is spent 
between the trace event reporting the local WAL was written and the trace event 
reporting apply started. Naively, this should mean the operation took a long 
time to replicate, as that's the most obvious thing that could take a long 
time. However, the metrics show the time was taken waiting in the apply queue, 
held up by some other slow activity. This operation was fast to perform-- it 
took a long time because it was held up, idling, by something else.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to