Hi, I have another issue with timestamps returned by bpf_ktime_get_ns() but I am not sure I identified the root cause: https://github.com/weaveworks/scope/issues/2650
I have a few questions that might help to understand: - Does the issue happen when you use a percpu map? https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array - Do you test on bare metal or on some kind of VM such as on EC2, GCE or VirtualBox? My problem does not happen on bare metal, only in virtualized environments. I noticed that one BPF function can take more than 100ms in virtualized environments, presumably because the hypervisor schedule the CPU to another VM. I took two timestamps with bpf_ktime_get_ns() in the same BPF function and I check the difference. On bare metal, I have not noticed this kind of variable time of execution. - The time between your bpf_ktime_get_ns() and your tsp.update() is not instantaneous. Another CPU might execute the BPF function concurrently and finish faster than the first CPU. I have not observed this kind of issues in practice on bare metal but I have seen that on GCE. - I noticed this worrying comment in __ktime_get_fast_ns (code behind bpf_ktime_get_ns()): "other CPUs are likely to be able observe [time going backward]" https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388 I don't understand the full details. - Which kernel version and which Linux distribution do you use? There was a kernel bug that caused timestamps to be wrong: See https://github.com/weaveworks/scope/issues/2334#issue-214111575: > This bug was introduced in 4.8 but was backported to older kernels by > distros. The fix was included in 4.9 and also backported. For example, in > Ubuntu, the bug was introduced in kernel 4.4.0-42 and it's not fixed until > kernel 4.4.0-51. See https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef and https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d Cheers, Alban On Mon, Jul 24, 2017 at 9:49 AM, Nair, Reena via iovisor-dev <[email protected]> wrote: > Hi, > > > While tracing the scheduling events of a particular task, I noticed that > sometimes, bpf_ktime_get_ns() returns a previous time stamp. Whenever the > time stamps become out of order, the subsequent events follow the earlier > timestamp. For example, something like: > > > Old TSP New TSP > > > 23456 23458 > > 23458 23461 > > .... > > .... > > 23509 23467 > > 23467 23470 > > 23470....... > > ...... 23527 > > 23527 23446 > > 23446 ......... > > > I have used a return probe with sys_clone(0) to store the thread Ids of a > particular task, and used this information in the sched_switch tracepoint to > trace the switching events of my task. The task is run on an 8-core machine. > > > What could be reason for this behaviour? I have included the code, in case I > have done something wrong??? > > > ###################################################################### > > > from bcc import BPF > > # define BPF program > bpf_text = """ > > #include <uapi/linux/ptrace.h> > #include <linux/sched.h> > #include <linux/types.h> > > struct values_t{ > u32 threadId; > char comm[TASK_COMM_LEN]; > }; > > BPF_HASH(threadList, u32, u32); //Stores threadIds of participating > threads > BPF_HASH(tsp, u32, u64, 1); //Stores timestamp of previous event > > /*struct sched_switch_args { > // from /sys/kernel/debug/tracing/events/random/urandom_read/format > u64 __unused__; > char prev_comm[16]; > pid_t prev_pid; > int prev_prio; > long prev_state; > char next_comm[16]; > pid_t next_pid; > int next_prio; > };*/ > > //Return Probe for sys_clone() - stores the thread Ids of a task with > command, "prod" > int trace_sys_clone(struct pt_regs *ctx) { > > struct values_t value = {}; > u32 one = 1, *val, *val1; > > bpf_get_current_comm(&value.comm, sizeof(value.comm)); > > if(value.comm[0]=='p' && value.comm[1] == 'r' && value.comm[2]=='o' > && value.comm[3]=='d') { > > //Store the parent thread ID > value.threadId = bpf_get_current_pid_tgid(); > val = threadList.lookup_or_init(&value.threadId, &one); > > //Stores other thread IDs > struct values_t value1={}; > value1.threadId = PT_REGS_RC(ctx); > val1 = threadList.lookup_or_init(&value1.threadId, &one); > } > return 0; > } > > > //Tracepoint probe for the Sched_Switch tracepoint - stores the previous > event timestamp in a map > > TRACEPOINT_PROBE(sched, sched_switch){ > > u32 *prev, *next; //Pointers to entries in threadList map > u32 next_pid, prev_pid, key=0; > > //Copy data to BPF stack > bpf_probe_read(&next_pid, sizeof(next_pid), &args->next_pid); > bpf_probe_read(&prev_pid, sizeof(prev_pid), &args->prev_pid); > > //Look up thread ids in the list created by sys_clone() > next = threadList.lookup(&next_pid); > prev = threadList.lookup(&prev_pid); > > //If task-threads are involved in Context Switch > if( prev || next){ > > u64 newTS = bpf_ktime_get_ns(); > u64 *oldTS = tsp.lookup_or_init(&key, &newTS); //Retrieve previous > timestamp > > if(oldTS==0) > return 0; > > if(newTS > *oldTS) > bpf_trace_printk("Correct: Old: %llu New: %llu\\n", *oldTS, > newTS); > else > bpf_trace_printk("Wrong: Old: %llu New: %llu\\n", *oldTS, > newTS); > > tsp.update(&key, &newTS); //Update timestamp > } > return 0; > } > """ > > # load BPF program > b = BPF(text=bpf_text) > b.attach_kretprobe(event="sys_clone", fn_name="trace_sys_clone") > > # format output > while 1: > try: > (task, pid, cpu, flags, ts, msg) = b.trace_fields() > print("%s" % (msg)) > > except ValueError: > continue > > > ################################################################################ > > > The tool cpudist.py also mentions about a case when timestamps may be out of > order. > > > Many Thanks , > > Reena. > > > > > > _______________________________________________ > iovisor-dev mailing list > [email protected] > https://lists.iovisor.org/mailman/listinfo/iovisor-dev > _______________________________________________ iovisor-dev mailing list [email protected] https://lists.iovisor.org/mailman/listinfo/iovisor-dev
