Re: [iovisor-dev] Out of order clock values when tracing scheduling events

Alban Crequy via iovisor-dev Mon, 24 Jul 2017 03:33:08 -0700

Hi,

I have another issue with timestamps returned by bpf_ktime_get_ns()
but I am not sure I identified the root cause:
https://github.com/weaveworks/scope/issues/2650


I have a few questions that might help to understand:

- Does the issue happen when you use a percpu map?
https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array

- Do you test on bare metal or on some kind of VM such as on EC2, GCE
or VirtualBox?

My problem does not happen on bare metal, only in virtualized
environments. I noticed that one BPF function can take more than 100ms
in virtualized environments, presumably because the hypervisor
schedule the CPU to another VM. I took two timestamps with
bpf_ktime_get_ns() in the same BPF function and I check the
difference. On bare metal, I have not noticed this kind of variable
time of execution.

- The time between your bpf_ktime_get_ns() and your tsp.update() is
not instantaneous. Another CPU might execute the BPF function
concurrently and finish faster than the first CPU. I have not observed
this kind of issues in practice on bare metal but I have seen that on
GCE.

- I noticed this worrying comment in __ktime_get_fast_ns (code behind
bpf_ktime_get_ns()):
"other CPUs are likely to be able observe [time going backward]"
https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388
I don't understand the full details.

- Which kernel version and which Linux distribution do you use? There
was a kernel bug that caused timestamps to be wrong:
See https://github.com/weaveworks/scope/issues/2334#issue-214111575:
> This bug was introduced in 4.8 but was backported to older kernels by 
> distros. The fix was included in 4.9 and also backported. For example, in 
> Ubuntu, the bug was introduced in kernel 4.4.0-42 and it's not fixed until 
> kernel 4.4.0-51.
See 
https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef
and 
https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d

Cheers,
Alban


On Mon, Jul 24, 2017 at 9:49 AM, Nair, Reena via iovisor-dev
<[email protected]> wrote:
> Hi,
>
>
> While tracing the scheduling events of a particular task, I noticed that
> sometimes,  bpf_ktime_get_ns() returns a previous time stamp. Whenever the
> time stamps become out of order, the subsequent events follow the  earlier
> timestamp. For example, something like:
>
>
> Old TSP            New TSP
>
>
> 23456               23458
>
> 23458               23461
>
> ....
>
> ....
>
> 23509               23467
>
> 23467               23470
>
> 23470.......
>
> ......                    23527
>
> 23527                23446
>
> 23446                .........
>
>
> I have used a return probe with sys_clone(0) to store the thread Ids of a
> particular task, and used this information in the sched_switch tracepoint to
> trace the switching events of my task. The task is run on an 8-core machine.
>
>
> What could be reason for this behaviour? I have included the code, in case I
> have done something wrong???
>
>
> ######################################################################
>
>
> from bcc import BPF
>
> # define BPF program
> bpf_text = """
>
> #include <uapi/linux/ptrace.h>
> #include <linux/sched.h>
> #include <linux/types.h>
>
> struct values_t{
>         u32 threadId;
>         char comm[TASK_COMM_LEN];
> };
>
> BPF_HASH(threadList, u32, u32);     //Stores threadIds of participating
> threads
> BPF_HASH(tsp, u32, u64, 1);         //Stores timestamp of previous event
>
> /*struct sched_switch_args {
>     // from /sys/kernel/debug/tracing/events/random/urandom_read/format
>     u64 __unused__;
>     char prev_comm[16];
>     pid_t prev_pid;
>     int prev_prio;
>     long prev_state;
>     char next_comm[16];
>     pid_t next_pid;
>     int next_prio;
> };*/
>
> //Return Probe for sys_clone() - stores the thread Ids of a task with
> command, "prod"
> int trace_sys_clone(struct pt_regs *ctx) {
>
>         struct values_t value = {};
>         u32 one = 1, *val, *val1;
>
>         bpf_get_current_comm(&value.comm, sizeof(value.comm));
>
>         if(value.comm[0]=='p' && value.comm[1] == 'r' && value.comm[2]=='o'
> && value.comm[3]=='d') {
>
>                 //Store the parent thread ID
>                 value.threadId = bpf_get_current_pid_tgid();
>                 val = threadList.lookup_or_init(&value.threadId, &one);
>
>                 //Stores other thread IDs
>                 struct values_t value1={};
>                 value1.threadId = PT_REGS_RC(ctx);
>                 val1 = threadList.lookup_or_init(&value1.threadId, &one);
>         }
>         return 0;
> }
>
>
> //Tracepoint probe for the Sched_Switch tracepoint - stores the previous
> event timestamp in a map
>
> TRACEPOINT_PROBE(sched, sched_switch){
>
>     u32 *prev, *next;  //Pointers to entries in threadList map
>     u32 next_pid, prev_pid, key=0;
>
>     //Copy data to BPF stack
>     bpf_probe_read(&next_pid, sizeof(next_pid), &args->next_pid);
>     bpf_probe_read(&prev_pid, sizeof(prev_pid), &args->prev_pid);
>
>     //Look up thread ids in the list created by sys_clone()
>     next = threadList.lookup(&next_pid);
>     prev = threadList.lookup(&prev_pid);
>
>     //If task-threads are involved in Context Switch
>     if( prev || next){
>
>         u64 newTS = bpf_ktime_get_ns();
>         u64 *oldTS = tsp.lookup_or_init(&key, &newTS); //Retrieve previous
> timestamp
>
>         if(oldTS==0)
>                 return 0;
>
>         if(newTS > *oldTS)
>                 bpf_trace_printk("Correct: Old: %llu New: %llu\\n", *oldTS,
> newTS);
>         else
>                 bpf_trace_printk("Wrong: Old: %llu New: %llu\\n", *oldTS,
> newTS);
>
>         tsp.update(&key, &newTS); //Update timestamp
>      }
>     return 0;
> }
> """
>
> # load BPF program
> b = BPF(text=bpf_text)
> b.attach_kretprobe(event="sys_clone", fn_name="trace_sys_clone")
>
> # format output
> while 1:
>    try:
>       (task, pid, cpu, flags, ts, msg) = b.trace_fields()
>       print("%s" % (msg))
>
>    except ValueError:
>         continue
>
>
> ################################################################################
>
>
> The tool cpudist.py also mentions about a case when timestamps may be out of
> order.
>
>
> Many Thanks ,
>
> Reena.
>
>
>
>
>
> _______________________________________________
> iovisor-dev mailing list
> [email protected]
> https://lists.iovisor.org/mailman/listinfo/iovisor-dev
>
_______________________________________________
iovisor-dev mailing list
[email protected]
https://lists.iovisor.org/mailman/listinfo/iovisor-dev

Re: [iovisor-dev] Out of order clock values when tracing scheduling events

Reply via email to