Re: [iovisor-dev] Out of order clock values when tracing scheduling events

Nair, Reena via iovisor-dev Mon, 24 Jul 2017 14:48:43 -0700

Hi,


The clock source for bpf_ktime_get_ns() is clock_monotonic which should never 
go backward, I presume. So even if the timestamps are recorded out of order, 
may be because of simultaneous access across CPUs, eventually it should 
monotonically increase.  Is bpf_ktime_get_ns() using CLOCK_MONOTONIC or 
CLOCK_MONOTONIC_RAW? If it is the former, then could this be due to NTP updates 
as suggested by Y Song?


As Alban mentioned, if somebody could clarify


The comment in __ktime_get_fast_ns (code behind bpf_ktime_get_ns()):
"other CPUs are likely to be able observe [time going backward]"
https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388


may be that would explain the reason behind this


 Reena

________________________________
From: Y Song <[email protected]>
Sent: 25 July 2017 01:19:45
To: Nair, Reena
Cc: Alban Crequy; [email protected]
Subject: Re: [iovisor-dev] Out of order clock values when tracing scheduling 
events

Just a wild guess. Could it be due to subsequent bpf invocations are not in
the same cpu? The previous  invocation reads timestamp in one cpu,
and the next invocation gets timestamp in a different cpu.
If two cpus are not sync'ing their cpu's perfectly, you may get an issue?
If there is an update (e.g., NTP), different cpu may have different pace of
adjustment, could this contribute the temporary timestamp divergence?

On Mon, Jul 24, 2017 at 7:43 AM, Nair, Reena via iovisor-dev 
<[email protected]<mailto:[email protected]>> wrote:


Hi,


I am running the program on bare metal, Ubuntu 16.04.2 LTS, kernel version: 
4.8.0-40800-generic. I addressed the following suggestions:


  1.  - The time between your bpf_ktime_get_ns() and your tsp.update() is
not instantaneous.   - Changed
  2.  Replaced BPF_HASH with PERCPU_ARRAY

Still the problem persists...

Reena.







________________________________
From: Alban Crequy <[email protected]<mailto:[email protected]>>
Sent: 24 July 2017 16:02:51
To: Nair, Reena
Cc: [email protected]<mailto:[email protected]>
Subject: Re: [iovisor-dev] Out of order clock values when tracing scheduling 
events

Hi,

I have another issue with timestamps returned by bpf_ktime_get_ns()
but I am not sure I identified the root cause:
https://github.com/weaveworks/scope/issues/2650
[https://avatars1.githubusercontent.com/u/109109?v=4&s=400]<https://github.com/weaveworks/scope/issues/2650>

probes falling back to proc parsing due to late ... - 
GitHub<https://github.com/weaveworks/scope/issues/2650>
github.com<http://github.com>
Join GitHub today. GitHub is home to over 20 million developers working 
together to host and review code, manage projects, and build software together.




I have a few questions that might help to understand:

- Does the issue happen when you use a percpu map?
https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array
[https://avatars1.githubusercontent.com/u/12766186?v=4&s=400]<https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array>

bcc/reference_guide.md at master · iovisor/bcc · 
GitHub<https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array>
github.com<http://github.com>
bcc - BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and 
more




- Do you test on bare metal or on some kind of VM such as on EC2, GCE
or VirtualBox?

My problem does not happen on bare metal, only in virtualized
environments. I noticed that one BPF function can take more than 100ms
in virtualized environments, presumably because the hypervisor
schedule the CPU to another VM. I took two timestamps with
bpf_ktime_get_ns() in the same BPF function and I check the
difference. On bare metal, I have not noticed this kind of variable
time of execution.

- The time between your bpf_ktime_get_ns() and your tsp.update() is
not instantaneous. Another CPU might execute the BPF function
concurrently and finish faster than the first CPU. I have not observed
this kind of issues in practice on bare metal but I have seen that on
GCE.

- I noticed this worrying comment in __ktime_get_fast_ns (code behind
bpf_ktime_get_ns()):
"other CPUs are likely to be able observe [time going backward]"
https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388
[https://avatars1.githubusercontent.com/u/1024025?v=4&s=400]<https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388>

torvalds/linux<https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388>
github.com<http://github.com>
linux - Linux kernel source tree



I don't understand the full details.

- Which kernel version and which Linux distribution do you use? There
was a kernel bug that caused timestamps to be wrong:
See https://github.com/weaveworks/scope/issues/2334#issue-214111575:
[https://avatars2.githubusercontent.com/u/507354?v=4&s=400]<https://github.com/weaveworks/scope/issues/2334#issue-214111575:>

ebpf: fall back to proc parsing when we detect a late event · Issue #2334 · 
weaveworks/scope<https://github.com/weaveworks/scope/issues/2334#issue-214111575:>
github.com<http://github.com>
There is a kernel bug introduced by torvalds/linux@27727df (see bcc's report) 
that makes timestamps be wrong and results in messages like this in scope's 
logs: <probe> ERRO: 2017/03/13 12:33:19.623...



> This bug was introduced in 4.8 but was backported to older kernels by 
> distros. The fix was included in 4.9 and also backported. For example, in 
> Ubuntu, the bug was introduced in kernel 4.4.0-42 and it's not fixed until 
> kernel 4.4.0-51.
See 
https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef
[https://avatars1.githubusercontent.com/u/1054045?v=4&s=200]<https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef>

timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING · 
torvalds/linux@27727df<https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef>
github.com<http://github.com>
When I added some extra sanity checking in timekeeping_get_ns() under 
CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns() 
method was using timekeeping_get_ns(). Thus the loc...



and 
https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d
[https://avatars1.githubusercontent.com/u/1054045?v=4&s=200]<https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d>

timekeeping: Fix __ktime_get_fast_ns() regression · 
torvalds/linux@58bfea9<https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d>
github.com<http://github.com>
In commit 27727df240c7 ("Avoid taking lock in NMI path with 
CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code the 
timekeeping_get_ns() function, but I forgot to include the unit conversi...




Cheers,
Alban


On Mon, Jul 24, 2017 at 9:49 AM, Nair, Reena via iovisor-dev
<[email protected]<mailto:[email protected]>> wrote:
> Hi,
>
>
> While tracing the scheduling events of a particular task, I noticed that
> sometimes,  bpf_ktime_get_ns() returns a previous time stamp. Whenever the
> time stamps become out of order, the subsequent events follow the  earlier
> timestamp. For example, something like:
>
>
> Old TSP            New TSP
>
>
> 23456               23458
>
> 23458               23461
>
> ....
>
> ....
>
> 23509               23467
>
> 23467               23470
>
> 23470.......
>
> ......                    23527
>
> 23527                23446
>
> 23446                .........
>
>
> I have used a return probe with sys_clone(0) to store the thread Ids of a
> particular task, and used this information in the sched_switch tracepoint to
> trace the switching events of my task. The task is run on an 8-core machine.
>
>
> What could be reason for this behaviour? I have included the code, in case I
> have done something wrong???
>
>
> ######################################################################
>
>
> from bcc import BPF
>
> # define BPF program
> bpf_text = """
>
> #include <uapi/linux/ptrace.h>
> #include <linux/sched.h>
> #include <linux/types.h>
>
> struct values_t{
>         u32 threadId;
>         char comm[TASK_COMM_LEN];
> };
>
> BPF_HASH(threadList, u32, u32);     //Stores threadIds of participating
> threads
> BPF_HASH(tsp, u32, u64, 1);         //Stores timestamp of previous event
>
> /*struct sched_switch_args {
>     // from /sys/kernel/debug/tracing/events/random/urandom_read/format
>     u64 __unused__;
>     char prev_comm[16];
>     pid_t prev_pid;
>     int prev_prio;
>     long prev_state;
>     char next_comm[16];
>     pid_t next_pid;
>     int next_prio;
> };*/
>
> //Return Probe for sys_clone() - stores the thread Ids of a task with
> command, "prod"
> int trace_sys_clone(struct pt_regs *ctx) {
>
>         struct values_t value = {};
>         u32 one = 1, *val, *val1;
>
>         bpf_get_current_comm(&value.comm, sizeof(value.comm));
>
>         if(value.comm[0]=='p' && value.comm[1] == 'r' && value.comm[2]=='o'
> && value.comm[3]=='d') {
>
>                 //Store the parent thread ID
>                 value.threadId = bpf_get_current_pid_tgid();
>                 val = threadList.lookup_or_init(&value.threadId, &one);
>
>                 //Stores other thread IDs
>                 struct values_t value1={};
>                 value1.threadId = PT_REGS_RC(ctx);
>                 val1 = threadList.lookup_or_init(&value1.threadId, &one);
>         }
>         return 0;
> }
>
>
> //Tracepoint probe for the Sched_Switch tracepoint - stores the previous
> event timestamp in a map
>
> TRACEPOINT_PROBE(sched, sched_switch){
>
>     u32 *prev, *next;  //Pointers to entries in threadList map
>     u32 next_pid, prev_pid, key=0;
>
>     //Copy data to BPF stack
>     bpf_probe_read(&next_pid, sizeof(next_pid), &args->next_pid);
>     bpf_probe_read(&prev_pid, sizeof(prev_pid), &args->prev_pid);
>
>     //Look up thread ids in the list created by sys_clone()
>     next = threadList.lookup(&next_pid);
>     prev = threadList.lookup(&prev_pid);
>
>     //If task-threads are involved in Context Switch
>     if( prev || next){
>
>         u64 newTS = bpf_ktime_get_ns();
>         u64 *oldTS = tsp.lookup_or_init(&key, &newTS); //Retrieve previous
> timestamp
>
>         if(oldTS==0)
>                 return 0;
>
>         if(newTS > *oldTS)
>                 bpf_trace_printk("Correct: Old: %llu New: %llu\\n", *oldTS,
> newTS);
>         else
>                 bpf_trace_printk("Wrong: Old: %llu New: %llu\\n", *oldTS,
> newTS);
>
>         tsp.update(&key, &newTS); //Update timestamp
>      }
>     return 0;
> }
> """
>
> # load BPF program
> b = BPF(text=bpf_text)
> b.attach_kretprobe(event="sys_clone", fn_name="trace_sys_clone")
>
> # format output
> while 1:
>    try:
>       (task, pid, cpu, flags, ts, msg) = b.trace_fields()
>       print("%s" % (msg))
>
>    except ValueError:
>         continue
>
>
> ################################################################################
>
>
> The tool cpudist.py also mentions about a case when timestamps may be out of
> order.
>
>
> Many Thanks ,
>
> Reena.
>
>
>
>
>
> _______________________________________________
> iovisor-dev mailing list
> [email protected]<mailto:[email protected]>
> https://lists.iovisor.org/mailman/listinfo/iovisor-dev
iovisor-dev Info Page<https://lists.iovisor.org/mailman/listinfo/iovisor-dev>
lists.iovisor.org<http://lists.iovisor.org>
Discussion list for IOVisor development. To see the collection of prior 
postings to the list, visit the iovisor-dev Archives. Using iovisor-dev



>

_______________________________________________
iovisor-dev mailing list
[email protected]<mailto:[email protected]>
https://lists.iovisor.org/mailman/listinfo/iovisor-dev

_______________________________________________
iovisor-dev mailing list
[email protected]
https://lists.iovisor.org/mailman/listinfo/iovisor-dev

Re: [iovisor-dev] Out of order clock values when tracing scheduling events

Reply via email to