performance latency analysis with LTTng

Blinick, Stephen L Wed, 03 Dec 2014 15:52:13 -0800

[trying a 2nd time without attachments]

This is a follow-on to the discussion in the 12/3 performance meeting.   I 
wanted to share what we've done with IO path analysis and LTTng so far, I'd 
like to see if we can establish a general methodology we could use to go after 
different areas.


We've been characterizing memstore IO path for a while, using VTune and other 
tools.  The idea with this tracing was to get a high level  a->b->c set of 
points and timings and drill down in areas which are larger contributors to 
latency.   Looking from the point of view of the client (i.e. rados bench), 
there's still some gaps in the latency being totaled up by the trace processing 
script.  Basically:


1)  Instrument code, install, run benchmark w/ lttng tracing enabled 
............ qd=1, single osd, memstore, logging disabled, 4K writes .......

        Total time run:         20.001899
        Total writes made:      16136
        Write size:             4096
        Bandwidth (MB/sec):     3.151 

        Stddev Bandwidth:       0.721435
        Max bandwidth (MB/sec): 3.39453
        Min bandwidth (MB/sec): 0
        Average Latency:        0.00123722
        Stddev Latency:         0.000103216
        Max latency:            0.002637
        Min latency:            0.000401

2)  Grab the lttng trace: 

[09:34:23.661871204] (+0.000003133) cephtestnode1 ioctximpl:aio_write: { cpu_id 
= 12 }, { pthread_id = 140262776686656 }, { oid = 
"benchmark_data_cephtestnode1_37038_object16" } [09:34:23.661873274] 
(+0.000002070) cephtestnode1 ioctximpl:queue_aio_write: { cpu_id = 12 }, { 
pthread_id = 140262776686656 }, { num = 0 } .....
 [09:34:23.662166430] (+0.000206888) cephtestnode1 osd:ms_fast_dispatch: { 
cpu_id = 19 }, { pthread_id = 139986434324224 }, { oid = 
"benchmark_data_cephtestnode1_37038_object16", type = 8, num = 4107, tid = 17, 
inc = 0 } [09:34:23.662176713] (+0.000010283) cephtestnode1 
oprequest:set_rmw_flags: { cpu_id = 19 }, { pthread_id = 139986434324224 }, { 
type = 8, num = 4107, tid = 17, inc = 0, flag = 0x4, old_rmw_flags = 0x0, 
new_rmw_flags = 0x4 } [09:34:23.662193699] (+0.000016986) cephtestnode1 
osd:enqueue_op: { cpu_id = 19 }, { pthread_id = 139986434324224 }, { oid = 
"benchmark_data_cephtestnode1_37038_object16" } ......

3) Process it with a perl script, using the OID (traced in some spots) to track 
a single op.  When an op completes (hits the last trace point that we have 
instrumented) we grab all the time stamps for the previous tracepoints and 
gather the following info, which shows the thread ID, and the delta in time 
(microseconds) since the last trace.    Example:
        Op Transition list for benchmark_data_cephtestnode1_37038_object53, 18 
states
          - [ ThrID   us ]
          - [ 86656 +0000] Start-> ioctximpl:aio_write:
          - [ 86656 +0010]     -> ioctximpl:queue_aio_write:
          - [ 24224 +0300]     -> osd:ms_fast_dispatch:
          - [ 24224 +0011]     -> oprequest:set_rmw_flags:
          - [ 24224 +0017]     -> osd:enqueue_op:
          - [ 24224 +0006]     -> osd:shard_index_enqueue:
          - [ 24224 +0013]     -> pg:queue_op:
          - [ 94688 +0068]     -> osd:opwq_process_start:
          - [ 94688 +0005]     -> osd:dequeue_op:
          - [ 94688 +0003]     -> mark_flagpt:reached_pg
          - [ 94688 +0049]     -> mark_flagpt:started
          - [ 94688 +0005]     -> mark_flagpt:started_2
          - [ 94688 +0005]     -> osd:prepare_tx_enter:
          - [ 94688 +0003]     -> osd:do_osd_op_pre:
          - [ 94688 +0002]     -> osd:do_osd_op_pre_write:
          - [ 94688 +0015]     -> osd:do_osd_op_post:
          - [ 94688 +0026]     -> osd:prepare_tx_exit:
          - [ 94688 +0188]     -> osd:opwq_process_finish:
          === > Total: 726 us


In order to see a trace of the entire run (vs just looking at 1 op or an 
average across thousands) I generated a CSV with all of the ops as well, and 
stuck them in a graph.  

The total time tracked by the script is about 750us and rados bench reports an 
average latency of about 1.2ms.  So there's still some missing parts, I believe 
a fair amount is in the commit being seen by the client.    The IOPS measured 
by the script (counting the 'process_finish' traces) match closely with what 
rados bench is reporting.  

We have a lot of stuff turned off by using single replica and memstore, but it 
seems the latency tracked here will still contribute to those configurations.   
Any thoughts on how to better track a single op back to the client thread would 
be welcomed, as well as any other ways of doing stuff.. Thanks!

I've attached some files with more detail but I've had troubles in the past 
with attachments so not sure they'll go through.

Thanks,

Stephen
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

performance latency analysis with LTTng

Reply via email to