Hi,
I am running the program on bare metal, Ubuntu 16.04.2 LTS, kernel version: 4.8.0-40800-generic. I addressed the following suggestions: 1. - The time between your bpf_ktime_get_ns() and your tsp.update() is not instantaneous. - Changed 2. Replaced BPF_HASH with PERCPU_ARRAY Still the problem persists... Reena. ________________________________ From: Alban Crequy <[email protected]> Sent: 24 July 2017 16:02:51 To: Nair, Reena Cc: [email protected] Subject: Re: [iovisor-dev] Out of order clock values when tracing scheduling events Hi, I have another issue with timestamps returned by bpf_ktime_get_ns() but I am not sure I identified the root cause: https://github.com/weaveworks/scope/issues/2650 [https://avatars1.githubusercontent.com/u/109109?v=4&s=400]<https://github.com/weaveworks/scope/issues/2650> probes falling back to proc parsing due to late ... - GitHub<https://github.com/weaveworks/scope/issues/2650> github.com Join GitHub today. GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together. I have a few questions that might help to understand: - Does the issue happen when you use a percpu map? https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array [https://avatars1.githubusercontent.com/u/12766186?v=4&s=400]<https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array> bcc/reference_guide.md at master · iovisor/bcc · GitHub<https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array> github.com bcc - BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more - Do you test on bare metal or on some kind of VM such as on EC2, GCE or VirtualBox? My problem does not happen on bare metal, only in virtualized environments. I noticed that one BPF function can take more than 100ms in virtualized environments, presumably because the hypervisor schedule the CPU to another VM. I took two timestamps with bpf_ktime_get_ns() in the same BPF function and I check the difference. On bare metal, I have not noticed this kind of variable time of execution. - The time between your bpf_ktime_get_ns() and your tsp.update() is not instantaneous. Another CPU might execute the BPF function concurrently and finish faster than the first CPU. I have not observed this kind of issues in practice on bare metal but I have seen that on GCE. - I noticed this worrying comment in __ktime_get_fast_ns (code behind bpf_ktime_get_ns()): "other CPUs are likely to be able observe [time going backward]" https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388 [https://avatars1.githubusercontent.com/u/1024025?v=4&s=400]<https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388> torvalds/linux<https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388> github.com linux - Linux kernel source tree I don't understand the full details. - Which kernel version and which Linux distribution do you use? There was a kernel bug that caused timestamps to be wrong: See https://github.com/weaveworks/scope/issues/2334#issue-214111575: [https://avatars2.githubusercontent.com/u/507354?v=4&s=400]<https://github.com/weaveworks/scope/issues/2334#issue-214111575:> ebpf: fall back to proc parsing when we detect a late event · Issue #2334 · weaveworks/scope<https://github.com/weaveworks/scope/issues/2334#issue-214111575:> github.com There is a kernel bug introduced by torvalds/linux@27727df (see bcc's report) that makes timestamps be wrong and results in messages like this in scope's logs: <probe> ERRO: 2017/03/13 12:33:19.623... > This bug was introduced in 4.8 but was backported to older kernels by > distros. The fix was included in 4.9 and also backported. For example, in > Ubuntu, the bug was introduced in kernel 4.4.0-42 and it's not fixed until > kernel 4.4.0-51. See https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef [https://avatars1.githubusercontent.com/u/1054045?v=4&s=200]<https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef> timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING · torvalds/linux@27727df<https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef> github.com When I added some extra sanity checking in timekeeping_get_ns() under CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns() method was using timekeeping_get_ns(). Thus the loc... and https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d [https://avatars1.githubusercontent.com/u/1054045?v=4&s=200]<https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d> timekeeping: Fix __ktime_get_fast_ns() regression · torvalds/linux@58bfea9<https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d> github.com In commit 27727df240c7 ("Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code the timekeeping_get_ns() function, but I forgot to include the unit conversi... Cheers, Alban On Mon, Jul 24, 2017 at 9:49 AM, Nair, Reena via iovisor-dev <[email protected]> wrote: > Hi, > > > While tracing the scheduling events of a particular task, I noticed that > sometimes, bpf_ktime_get_ns() returns a previous time stamp. Whenever the > time stamps become out of order, the subsequent events follow the earlier > timestamp. For example, something like: > > > Old TSP New TSP > > > 23456 23458 > > 23458 23461 > > .... > > .... > > 23509 23467 > > 23467 23470 > > 23470....... > > ...... 23527 > > 23527 23446 > > 23446 ......... > > > I have used a return probe with sys_clone(0) to store the thread Ids of a > particular task, and used this information in the sched_switch tracepoint to > trace the switching events of my task. The task is run on an 8-core machine. > > > What could be reason for this behaviour? I have included the code, in case I > have done something wrong??? > > > ###################################################################### > > > from bcc import BPF > > # define BPF program > bpf_text = """ > > #include <uapi/linux/ptrace.h> > #include <linux/sched.h> > #include <linux/types.h> > > struct values_t{ > u32 threadId; > char comm[TASK_COMM_LEN]; > }; > > BPF_HASH(threadList, u32, u32); //Stores threadIds of participating > threads > BPF_HASH(tsp, u32, u64, 1); //Stores timestamp of previous event > > /*struct sched_switch_args { > // from /sys/kernel/debug/tracing/events/random/urandom_read/format > u64 __unused__; > char prev_comm[16]; > pid_t prev_pid; > int prev_prio; > long prev_state; > char next_comm[16]; > pid_t next_pid; > int next_prio; > };*/ > > //Return Probe for sys_clone() - stores the thread Ids of a task with > command, "prod" > int trace_sys_clone(struct pt_regs *ctx) { > > struct values_t value = {}; > u32 one = 1, *val, *val1; > > bpf_get_current_comm(&value.comm, sizeof(value.comm)); > > if(value.comm[0]=='p' && value.comm[1] == 'r' && value.comm[2]=='o' > && value.comm[3]=='d') { > > //Store the parent thread ID > value.threadId = bpf_get_current_pid_tgid(); > val = threadList.lookup_or_init(&value.threadId, &one); > > //Stores other thread IDs > struct values_t value1={}; > value1.threadId = PT_REGS_RC(ctx); > val1 = threadList.lookup_or_init(&value1.threadId, &one); > } > return 0; > } > > > //Tracepoint probe for the Sched_Switch tracepoint - stores the previous > event timestamp in a map > > TRACEPOINT_PROBE(sched, sched_switch){ > > u32 *prev, *next; //Pointers to entries in threadList map > u32 next_pid, prev_pid, key=0; > > //Copy data to BPF stack > bpf_probe_read(&next_pid, sizeof(next_pid), &args->next_pid); > bpf_probe_read(&prev_pid, sizeof(prev_pid), &args->prev_pid); > > //Look up thread ids in the list created by sys_clone() > next = threadList.lookup(&next_pid); > prev = threadList.lookup(&prev_pid); > > //If task-threads are involved in Context Switch > if( prev || next){ > > u64 newTS = bpf_ktime_get_ns(); > u64 *oldTS = tsp.lookup_or_init(&key, &newTS); //Retrieve previous > timestamp > > if(oldTS==0) > return 0; > > if(newTS > *oldTS) > bpf_trace_printk("Correct: Old: %llu New: %llu\\n", *oldTS, > newTS); > else > bpf_trace_printk("Wrong: Old: %llu New: %llu\\n", *oldTS, > newTS); > > tsp.update(&key, &newTS); //Update timestamp > } > return 0; > } > """ > > # load BPF program > b = BPF(text=bpf_text) > b.attach_kretprobe(event="sys_clone", fn_name="trace_sys_clone") > > # format output > while 1: > try: > (task, pid, cpu, flags, ts, msg) = b.trace_fields() > print("%s" % (msg)) > > except ValueError: > continue > > > ################################################################################ > > > The tool cpudist.py also mentions about a case when timestamps may be out of > order. > > > Many Thanks , > > Reena. > > > > > > _______________________________________________ > iovisor-dev mailing list > [email protected] > https://lists.iovisor.org/mailman/listinfo/iovisor-dev iovisor-dev Info Page<https://lists.iovisor.org/mailman/listinfo/iovisor-dev> lists.iovisor.org Discussion list for IOVisor development. To see the collection of prior postings to the list, visit the iovisor-dev Archives. Using iovisor-dev >
_______________________________________________ iovisor-dev mailing list [email protected] https://lists.iovisor.org/mailman/listinfo/iovisor-dev
