Re: [iovisor-dev] Out of order clock values when tracing scheduling events

Nair, Reena via iovisor-dev Mon, 24 Jul 2017 07:43:33 -0700

Hi,


I am running the program on bare metal, Ubuntu 16.04.2 LTS, kernel version: 
4.8.0-40800-generic. I addressed the following suggestions:


  1.  - The time between your bpf_ktime_get_ns() and your tsp.update() is
not instantaneous.   - Changed
  2.  Replaced BPF_HASH with PERCPU_ARRAY

Still the problem persists...

Reena.







________________________________
From: Alban Crequy <[email protected]>
Sent: 24 July 2017 16:02:51
To: Nair, Reena
Cc: [email protected]
Subject: Re: [iovisor-dev] Out of order clock values when tracing scheduling 
events

Hi,

I have another issue with timestamps returned by bpf_ktime_get_ns()
but I am not sure I identified the root cause:
https://github.com/weaveworks/scope/issues/2650
[https://avatars1.githubusercontent.com/u/109109?v=4&s=400]<https://github.com/weaveworks/scope/issues/2650>

probes falling back to proc parsing due to late ... - 
GitHub<https://github.com/weaveworks/scope/issues/2650>
github.com
Join GitHub today. GitHub is home to over 20 million developers working 
together to host and review code, manage projects, and build software together.




I have a few questions that might help to understand:

- Does the issue happen when you use a percpu map?
https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array
[https://avatars1.githubusercontent.com/u/12766186?v=4&s=400]<https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array>

bcc/reference_guide.md at master · iovisor/bcc · 
GitHub<https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array>
github.com
bcc - BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and 
more




- Do you test on bare metal or on some kind of VM such as on EC2, GCE
or VirtualBox?

My problem does not happen on bare metal, only in virtualized
environments. I noticed that one BPF function can take more than 100ms
in virtualized environments, presumably because the hypervisor
schedule the CPU to another VM. I took two timestamps with
bpf_ktime_get_ns() in the same BPF function and I check the
difference. On bare metal, I have not noticed this kind of variable
time of execution.

- The time between your bpf_ktime_get_ns() and your tsp.update() is
not instantaneous. Another CPU might execute the BPF function
concurrently and finish faster than the first CPU. I have not observed
this kind of issues in practice on bare metal but I have seen that on
GCE.

- I noticed this worrying comment in __ktime_get_fast_ns (code behind
bpf_ktime_get_ns()):
"other CPUs are likely to be able observe [time going backward]"
https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388
[https://avatars1.githubusercontent.com/u/1024025?v=4&s=400]<https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388>

torvalds/linux<https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388>
github.com
linux - Linux kernel source tree



I don't understand the full details.

- Which kernel version and which Linux distribution do you use? There
was a kernel bug that caused timestamps to be wrong:
See https://github.com/weaveworks/scope/issues/2334#issue-214111575:
[https://avatars2.githubusercontent.com/u/507354?v=4&s=400]<https://github.com/weaveworks/scope/issues/2334#issue-214111575:>

ebpf: fall back to proc parsing when we detect a late event · Issue #2334 · 
weaveworks/scope<https://github.com/weaveworks/scope/issues/2334#issue-214111575:>
github.com
There is a kernel bug introduced by torvalds/linux@27727df (see bcc's report) 
that makes timestamps be wrong and results in messages like this in scope's 
logs: <probe> ERRO: 2017/03/13 12:33:19.623...



> This bug was introduced in 4.8 but was backported to older kernels by 
> distros. The fix was included in 4.9 and also backported. For example, in 
> Ubuntu, the bug was introduced in kernel 4.4.0-42 and it's not fixed until 
> kernel 4.4.0-51.
See 
https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef
[https://avatars1.githubusercontent.com/u/1054045?v=4&s=200]<https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef>

timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING · 
torvalds/linux@27727df<https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef>
github.com
When I added some extra sanity checking in timekeeping_get_ns() under 
CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns() 
method was using timekeeping_get_ns(). Thus the loc...



and 
https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d
[https://avatars1.githubusercontent.com/u/1054045?v=4&s=200]<https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d>

timekeeping: Fix __ktime_get_fast_ns() regression · 
torvalds/linux@58bfea9<https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d>
github.com
In commit 27727df240c7 ("Avoid taking lock in NMI path with 
CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code the 
timekeeping_get_ns() function, but I forgot to include the unit conversi...




Cheers,
Alban


On Mon, Jul 24, 2017 at 9:49 AM, Nair, Reena via iovisor-dev
<[email protected]> wrote:
> Hi,
>
>
> While tracing the scheduling events of a particular task, I noticed that
> sometimes,  bpf_ktime_get_ns() returns a previous time stamp. Whenever the
> time stamps become out of order, the subsequent events follow the  earlier
> timestamp. For example, something like:
>
>
> Old TSP            New TSP
>
>
> 23456               23458
>
> 23458               23461
>
> ....
>
> ....
>
> 23509               23467
>
> 23467               23470
>
> 23470.......
>
> ......                    23527
>
> 23527                23446
>
> 23446                .........
>
>
> I have used a return probe with sys_clone(0) to store the thread Ids of a
> particular task, and used this information in the sched_switch tracepoint to
> trace the switching events of my task. The task is run on an 8-core machine.
>
>
> What could be reason for this behaviour? I have included the code, in case I
> have done something wrong???
>
>
> ######################################################################
>
>
> from bcc import BPF
>
> # define BPF program
> bpf_text = """
>
> #include <uapi/linux/ptrace.h>
> #include <linux/sched.h>
> #include <linux/types.h>
>
> struct values_t{
>         u32 threadId;
>         char comm[TASK_COMM_LEN];
> };
>
> BPF_HASH(threadList, u32, u32);     //Stores threadIds of participating
> threads
> BPF_HASH(tsp, u32, u64, 1);         //Stores timestamp of previous event
>
> /*struct sched_switch_args {
>     // from /sys/kernel/debug/tracing/events/random/urandom_read/format
>     u64 __unused__;
>     char prev_comm[16];
>     pid_t prev_pid;
>     int prev_prio;
>     long prev_state;
>     char next_comm[16];
>     pid_t next_pid;
>     int next_prio;
> };*/
>
> //Return Probe for sys_clone() - stores the thread Ids of a task with
> command, "prod"
> int trace_sys_clone(struct pt_regs *ctx) {
>
>         struct values_t value = {};
>         u32 one = 1, *val, *val1;
>
>         bpf_get_current_comm(&value.comm, sizeof(value.comm));
>
>         if(value.comm[0]=='p' && value.comm[1] == 'r' && value.comm[2]=='o'
> && value.comm[3]=='d') {
>
>                 //Store the parent thread ID
>                 value.threadId = bpf_get_current_pid_tgid();
>                 val = threadList.lookup_or_init(&value.threadId, &one);
>
>                 //Stores other thread IDs
>                 struct values_t value1={};
>                 value1.threadId = PT_REGS_RC(ctx);
>                 val1 = threadList.lookup_or_init(&value1.threadId, &one);
>         }
>         return 0;
> }
>
>
> //Tracepoint probe for the Sched_Switch tracepoint - stores the previous
> event timestamp in a map
>
> TRACEPOINT_PROBE(sched, sched_switch){
>
>     u32 *prev, *next;  //Pointers to entries in threadList map
>     u32 next_pid, prev_pid, key=0;
>
>     //Copy data to BPF stack
>     bpf_probe_read(&next_pid, sizeof(next_pid), &args->next_pid);
>     bpf_probe_read(&prev_pid, sizeof(prev_pid), &args->prev_pid);
>
>     //Look up thread ids in the list created by sys_clone()
>     next = threadList.lookup(&next_pid);
>     prev = threadList.lookup(&prev_pid);
>
>     //If task-threads are involved in Context Switch
>     if( prev || next){
>
>         u64 newTS = bpf_ktime_get_ns();
>         u64 *oldTS = tsp.lookup_or_init(&key, &newTS); //Retrieve previous
> timestamp
>
>         if(oldTS==0)
>                 return 0;
>
>         if(newTS > *oldTS)
>                 bpf_trace_printk("Correct: Old: %llu New: %llu\\n", *oldTS,
> newTS);
>         else
>                 bpf_trace_printk("Wrong: Old: %llu New: %llu\\n", *oldTS,
> newTS);
>
>         tsp.update(&key, &newTS); //Update timestamp
>      }
>     return 0;
> }
> """
>
> # load BPF program
> b = BPF(text=bpf_text)
> b.attach_kretprobe(event="sys_clone", fn_name="trace_sys_clone")
>
> # format output
> while 1:
>    try:
>       (task, pid, cpu, flags, ts, msg) = b.trace_fields()
>       print("%s" % (msg))
>
>    except ValueError:
>         continue
>
>
> ################################################################################
>
>
> The tool cpudist.py also mentions about a case when timestamps may be out of
> order.
>
>
> Many Thanks ,
>
> Reena.
>
>
>
>
>
> _______________________________________________
> iovisor-dev mailing list
> [email protected]
> https://lists.iovisor.org/mailman/listinfo/iovisor-dev
iovisor-dev Info Page<https://lists.iovisor.org/mailman/listinfo/iovisor-dev>
lists.iovisor.org
Discussion list for IOVisor development. To see the collection of prior 
postings to the list, visit the iovisor-dev Archives. Using iovisor-dev



>

_______________________________________________
iovisor-dev mailing list
[email protected]
https://lists.iovisor.org/mailman/listinfo/iovisor-dev

Re: [iovisor-dev] Out of order clock values when tracing scheduling events

Reply via email to