Hi,
The clock source for bpf_ktime_get_ns() is clock_monotonic which should never go backward, I presume. So even if the timestamps are recorded out of order, may be because of simultaneous access across CPUs, eventually it should monotonically increase. Is bpf_ktime_get_ns() using CLOCK_MONOTONIC or CLOCK_MONOTONIC_RAW? If it is the former, then could this be due to NTP updates as suggested by Y Song? As Alban mentioned, if somebody could clarify The comment in __ktime_get_fast_ns (code behind bpf_ktime_get_ns()): "other CPUs are likely to be able observe [time going backward]" https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388 may be that would explain the reason behind this Reena ________________________________ From: Y Song <[email protected]> Sent: 25 July 2017 01:19:45 To: Nair, Reena Cc: Alban Crequy; [email protected] Subject: Re: [iovisor-dev] Out of order clock values when tracing scheduling events Just a wild guess. Could it be due to subsequent bpf invocations are not in the same cpu? The previous invocation reads timestamp in one cpu, and the next invocation gets timestamp in a different cpu. If two cpus are not sync'ing their cpu's perfectly, you may get an issue? If there is an update (e.g., NTP), different cpu may have different pace of adjustment, could this contribute the temporary timestamp divergence? On Mon, Jul 24, 2017 at 7:43 AM, Nair, Reena via iovisor-dev <[email protected]<mailto:[email protected]>> wrote: Hi, I am running the program on bare metal, Ubuntu 16.04.2 LTS, kernel version: 4.8.0-40800-generic. I addressed the following suggestions: 1. - The time between your bpf_ktime_get_ns() and your tsp.update() is not instantaneous. - Changed 2. Replaced BPF_HASH with PERCPU_ARRAY Still the problem persists... Reena. ________________________________ From: Alban Crequy <[email protected]<mailto:[email protected]>> Sent: 24 July 2017 16:02:51 To: Nair, Reena Cc: [email protected]<mailto:[email protected]> Subject: Re: [iovisor-dev] Out of order clock values when tracing scheduling events Hi, I have another issue with timestamps returned by bpf_ktime_get_ns() but I am not sure I identified the root cause: https://github.com/weaveworks/scope/issues/2650 [https://avatars1.githubusercontent.com/u/109109?v=4&s=400]<https://github.com/weaveworks/scope/issues/2650> probes falling back to proc parsing due to late ... - GitHub<https://github.com/weaveworks/scope/issues/2650> github.com<http://github.com> Join GitHub today. GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together. I have a few questions that might help to understand: - Does the issue happen when you use a percpu map? https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array [https://avatars1.githubusercontent.com/u/12766186?v=4&s=400]<https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array> bcc/reference_guide.md at master · iovisor/bcc · GitHub<https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array> github.com<http://github.com> bcc - BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more - Do you test on bare metal or on some kind of VM such as on EC2, GCE or VirtualBox? My problem does not happen on bare metal, only in virtualized environments. I noticed that one BPF function can take more than 100ms in virtualized environments, presumably because the hypervisor schedule the CPU to another VM. I took two timestamps with bpf_ktime_get_ns() in the same BPF function and I check the difference. On bare metal, I have not noticed this kind of variable time of execution. - The time between your bpf_ktime_get_ns() and your tsp.update() is not instantaneous. Another CPU might execute the BPF function concurrently and finish faster than the first CPU. I have not observed this kind of issues in practice on bare metal but I have seen that on GCE. - I noticed this worrying comment in __ktime_get_fast_ns (code behind bpf_ktime_get_ns()): "other CPUs are likely to be able observe [time going backward]" https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388 [https://avatars1.githubusercontent.com/u/1024025?v=4&s=400]<https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388> torvalds/linux<https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388> github.com<http://github.com> linux - Linux kernel source tree I don't understand the full details. - Which kernel version and which Linux distribution do you use? There was a kernel bug that caused timestamps to be wrong: See https://github.com/weaveworks/scope/issues/2334#issue-214111575: [https://avatars2.githubusercontent.com/u/507354?v=4&s=400]<https://github.com/weaveworks/scope/issues/2334#issue-214111575:> ebpf: fall back to proc parsing when we detect a late event · Issue #2334 · weaveworks/scope<https://github.com/weaveworks/scope/issues/2334#issue-214111575:> github.com<http://github.com> There is a kernel bug introduced by torvalds/linux@27727df (see bcc's report) that makes timestamps be wrong and results in messages like this in scope's logs: <probe> ERRO: 2017/03/13 12:33:19.623... > This bug was introduced in 4.8 but was backported to older kernels by > distros. The fix was included in 4.9 and also backported. For example, in > Ubuntu, the bug was introduced in kernel 4.4.0-42 and it's not fixed until > kernel 4.4.0-51. See https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef [https://avatars1.githubusercontent.com/u/1054045?v=4&s=200]<https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef> timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING · torvalds/linux@27727df<https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef> github.com<http://github.com> When I added some extra sanity checking in timekeeping_get_ns() under CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns() method was using timekeeping_get_ns(). Thus the loc... and https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d [https://avatars1.githubusercontent.com/u/1054045?v=4&s=200]<https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d> timekeeping: Fix __ktime_get_fast_ns() regression · torvalds/linux@58bfea9<https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d> github.com<http://github.com> In commit 27727df240c7 ("Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code the timekeeping_get_ns() function, but I forgot to include the unit conversi... Cheers, Alban On Mon, Jul 24, 2017 at 9:49 AM, Nair, Reena via iovisor-dev <[email protected]<mailto:[email protected]>> wrote: > Hi, > > > While tracing the scheduling events of a particular task, I noticed that > sometimes, bpf_ktime_get_ns() returns a previous time stamp. Whenever the > time stamps become out of order, the subsequent events follow the earlier > timestamp. For example, something like: > > > Old TSP New TSP > > > 23456 23458 > > 23458 23461 > > .... > > .... > > 23509 23467 > > 23467 23470 > > 23470....... > > ...... 23527 > > 23527 23446 > > 23446 ......... > > > I have used a return probe with sys_clone(0) to store the thread Ids of a > particular task, and used this information in the sched_switch tracepoint to > trace the switching events of my task. The task is run on an 8-core machine. > > > What could be reason for this behaviour? I have included the code, in case I > have done something wrong??? > > > ###################################################################### > > > from bcc import BPF > > # define BPF program > bpf_text = """ > > #include <uapi/linux/ptrace.h> > #include <linux/sched.h> > #include <linux/types.h> > > struct values_t{ > u32 threadId; > char comm[TASK_COMM_LEN]; > }; > > BPF_HASH(threadList, u32, u32); //Stores threadIds of participating > threads > BPF_HASH(tsp, u32, u64, 1); //Stores timestamp of previous event > > /*struct sched_switch_args { > // from /sys/kernel/debug/tracing/events/random/urandom_read/format > u64 __unused__; > char prev_comm[16]; > pid_t prev_pid; > int prev_prio; > long prev_state; > char next_comm[16]; > pid_t next_pid; > int next_prio; > };*/ > > //Return Probe for sys_clone() - stores the thread Ids of a task with > command, "prod" > int trace_sys_clone(struct pt_regs *ctx) { > > struct values_t value = {}; > u32 one = 1, *val, *val1; > > bpf_get_current_comm(&value.comm, sizeof(value.comm)); > > if(value.comm[0]=='p' && value.comm[1] == 'r' && value.comm[2]=='o' > && value.comm[3]=='d') { > > //Store the parent thread ID > value.threadId = bpf_get_current_pid_tgid(); > val = threadList.lookup_or_init(&value.threadId, &one); > > //Stores other thread IDs > struct values_t value1={}; > value1.threadId = PT_REGS_RC(ctx); > val1 = threadList.lookup_or_init(&value1.threadId, &one); > } > return 0; > } > > > //Tracepoint probe for the Sched_Switch tracepoint - stores the previous > event timestamp in a map > > TRACEPOINT_PROBE(sched, sched_switch){ > > u32 *prev, *next; //Pointers to entries in threadList map > u32 next_pid, prev_pid, key=0; > > //Copy data to BPF stack > bpf_probe_read(&next_pid, sizeof(next_pid), &args->next_pid); > bpf_probe_read(&prev_pid, sizeof(prev_pid), &args->prev_pid); > > //Look up thread ids in the list created by sys_clone() > next = threadList.lookup(&next_pid); > prev = threadList.lookup(&prev_pid); > > //If task-threads are involved in Context Switch > if( prev || next){ > > u64 newTS = bpf_ktime_get_ns(); > u64 *oldTS = tsp.lookup_or_init(&key, &newTS); //Retrieve previous > timestamp > > if(oldTS==0) > return 0; > > if(newTS > *oldTS) > bpf_trace_printk("Correct: Old: %llu New: %llu\\n", *oldTS, > newTS); > else > bpf_trace_printk("Wrong: Old: %llu New: %llu\\n", *oldTS, > newTS); > > tsp.update(&key, &newTS); //Update timestamp > } > return 0; > } > """ > > # load BPF program > b = BPF(text=bpf_text) > b.attach_kretprobe(event="sys_clone", fn_name="trace_sys_clone") > > # format output > while 1: > try: > (task, pid, cpu, flags, ts, msg) = b.trace_fields() > print("%s" % (msg)) > > except ValueError: > continue > > > ################################################################################ > > > The tool cpudist.py also mentions about a case when timestamps may be out of > order. > > > Many Thanks , > > Reena. > > > > > > _______________________________________________ > iovisor-dev mailing list > [email protected]<mailto:[email protected]> > https://lists.iovisor.org/mailman/listinfo/iovisor-dev iovisor-dev Info Page<https://lists.iovisor.org/mailman/listinfo/iovisor-dev> lists.iovisor.org<http://lists.iovisor.org> Discussion list for IOVisor development. To see the collection of prior postings to the list, visit the iovisor-dev Archives. Using iovisor-dev > _______________________________________________ iovisor-dev mailing list [email protected]<mailto:[email protected]> https://lists.iovisor.org/mailman/listinfo/iovisor-dev
_______________________________________________ iovisor-dev mailing list [email protected] https://lists.iovisor.org/mailman/listinfo/iovisor-dev
