On Mon, Jun 22, 2009 at 2:35 PM, Peter Zijlstra<a.p.zijls...@chello.nl> wrote: > On Mon, 2009-06-22 at 14:25 +0200, stephane eranian wrote: >> On Mon, Jun 22, 2009 at 1:52 PM, Ingo Molnar<mi...@elte.hu> wrote: >> >> 5/ Mmaped count >> >> >> >> It is possible to read counts directly from user space for >> >> self-monitoring threads. This leverages a HW capability present on >> >> some processors. On X86, this is possible via RDPMC. >> >> >> >> The full 64-bit count is constructed by combining the hardware >> >> value extracted with an assembly instruction and a base value made >> >> available thru the mmap. There is an atomic generation count >> >> available to deal with the race condition. >> >> >> >> I believe there is a problem with this approach given that the PMU >> >> is shared and that events can be multiplexed. That means that even >> >> though you are self-monitoring, events get replaced on the PMU. >> >> The assembly instruction is unaware of that, it reads a register >> >> not an event. >> >> >> >> On x86, assume event A is hosted in counter 0, thus you need >> >> RDPMC(0) to extract the count. But then, the event is replaced by >> >> another one which reuses counter 0. At the user level, you will >> >> still use RDPMC(0) but it will read the HW value from a different >> >> event and combine it with a base count from another one. >> >> >> >> To avoid this, you need to pin the event so it stays in the PMU at >> >> all times. Now, here is something unclear to me. Pinning does not >> >> mean stay in the SAME register, it means the event stays on the >> >> PMU but it can possibly change register. To prevent that, I >> >> believe you need to also set exclusive so that no other group can >> >> be scheduled, and thus possibly use the same counter. >> >> >> >> Looks like this is the only way you can make this actually work. >> >> Not setting pinned+exclusive, is another pitfall in which many >> >> people will fall into. >> > >> > do { >> > seq = pc->lock; >> > >> > barrier() >> > if (pc->index) { >> > count = pmc_read(pc->index - 1); >> > count += pc->offset; >> > } else >> > goto regular_read; >> > >> > barrier(); >> > } while (pc->lock != seq); >> > >> > We don't see the hole you are referring to. The sequence lock >> > ensures you get a consistent view. >> > >> Let's take an example, with two groups, one event in each group. >> Both events scheduled on counter0, i.e,, rdpmc(0). The 2 groups >> are multiplexed, one each tick. The user gets 2 file descriptors >> and thus two mmap'ed pages. >> >> Suppose the user wants to read, using the above loop, the value of the >> event in the first group BUT it's the 2nd group that is currently active >> and loaded on counter0, i.e., rdpmc(0) returns the value of the 2nd event. >> >> Unless you tell me that pc->index is marked invalid (0) when the >> event is not scheduled. I don't see how you can avoid reading >> the wrong value. I am assuming that is the event is not scheduled >> lock remains constant. > > Indeed, pc->index == 0 means its not currently available.
I don't see where you clear that field on x86. Looks like it comes from hwc->idx. I suspect you need to do something in x86_pmu_disable() to be symmetrical with x86_pmu_enable(). I suspect something similar needs to be done on Power. > >> Assuming the event is active when you enter the loop and you >> read a value. How to get the timing information to scale the >> count? > > I think we would have to add that do the data page,.. something like the > below? > Yes. > --- > Index: linux-2.6/include/linux/perf_counter.h > =================================================================== > --- linux-2.6.orig/include/linux/perf_counter.h > +++ linux-2.6/include/linux/perf_counter.h > @@ -232,6 +232,10 @@ struct perf_counter_mmap_page { > __u32 lock; /* seqlock for synchronization */ > __u32 index; /* hardware counter identifier */ > __s64 offset; /* add to hardware counter value */ > + __u64 total_time; /* total time counter active */ > + __u64 running_time; /* time counter on cpu */ > + > + __u64 __reserved[123]; /* align at 1k */ > > /* > * Control data for the mmap() data buffer. > Index: linux-2.6/kernel/perf_counter.c > =================================================================== > --- linux-2.6.orig/kernel/perf_counter.c > +++ linux-2.6/kernel/perf_counter.c > @@ -1782,6 +1782,12 @@ void perf_counter_update_userpage(struct > if (counter->state == PERF_COUNTER_STATE_ACTIVE) > userpg->offset -= atomic64_read(&counter->hw.prev_count); > > + userpg->total_time = counter->total_time_enabled + > + atomic64_read(&counter->child_total_time_enabled); > + > + userpg->running_time = counter->total_time_running + > + atomic64_read(&counter->child_total_time_running); > + > barrier(); > ++userpg->lock; > preempt_enable(); > > > ------------------------------------------------------------------------------ Are you an open source citizen? Join us for the Open Source Bridge conference! Portland, OR, June 17-19. Two days of sessions, one day of unconference: $250. Need another reason to go? 24-hour hacker lounge. Register today! http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org _______________________________________________ perfmon2-devel mailing list perfmon2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/perfmon2-devel