On 01/09/2011 6:45 AM, stephane eranian wrote:
> On Thu, Sep 1, 2011 at 3:29 PM, stephane eranian<[email protected]>
> wrote:
>> On Thu, Sep 1, 2011 at 3:06 PM, Ryan Johnson
>> <[email protected]> wrote:
>>> On 01/09/2011 1:55 AM, stephane eranian wrote:
>>>> On Thu, Sep 1, 2011 at 1:07 AM, Corey Ashford
>>>> <[email protected]> wrote:
>>>>> On 08/25/2011 07:19 AM, stephane eranian wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Sorry for late reply.
>>>>>>
>>>>>> The current support for mmaped count is broken on perf_event x86.
>>>>>> It simply does not work. I think it only works on PPC at this point.
>>>>> Just as an aside, you can access the counter registers from user space
>>>>> on Power (aka PPC) machines, but because the kernel is free to schedule
>>>>> the events onto whatever counters that meet the resource constraints,
>>>>> it's not at all clear which hardware counter to read from user space,
>>>>> and in fact, with event rotation, the counter being used can change from
>>>>> one system tick till the next.
>>>>>
>>>>> If you program a single event, you can be guaranteed that it won't move
>>>>> around, but you still will have to guess or somehow determine which
>>>>> hardware counter is being used by the kernel.
>>>>>
>>>> Yes, and that's why they have this 'lock' field in there.It's not really a
>>>> lock
>>>> but rather a generation counter. You need to read it before you attempt to
>>>> read and you need to check it when you're done reading. If the two values
>>>> don't match then the counter changed and you need to retry. And changes
>>>> means it may have moved to a different counter.
>>> This protocol is actually documented pretty well in
>>> <linux/perf_event.h>, too. Read the lock, read the index, read hw
>>> counter[index-1], read lock again to verify.
>>>
>>>> But the key problem here is the time scaling. In case you are multiplex
>>>> you need to be able to retrieve time_enabled and time_running to scale
>>>> the count. But that's not exposed, thus it does not work as soon as you
>>>> have multiplexing. Well, unless you only care about deltas and not the
>>>> absolute values.
>>> Doesn't perf_event_mmap_page expose both those, also protected by the
>>> generation counter? Or are you saying the kernel doesn't actually update
>>> those fields right now?
>>>
>> Yes, it does. I am not sure they're updated correctly, though.
>> I have not tried that in a very long time.
>>
> Did you manage to make libpfm4's self_count program work correctly?
> Even by just looking at the raw count coming out of rdpmc?
>
> I think there are issues with hdr->offset, i.e., the 64-bit sw-maintained
> base for the counter.
I only did limited testing because things took priority the last couple
of weeks, but I'll be back into it in the next couple of weeks.
Meanwhile, here's what I know:
The machine is a Westmere EX (which is why I can't just use an older
kernel+perfctr) running kernel 2.6.38. I've got the cvs head for papi,
wired up with git version 9fc1bc1e of libpfm4. self_count seg faults by
default because rdpmc is privileged, and papi's unit tests cause the
machine to hard-lock (have to use the hypervisor to reboot). One
definite culprit is ctests/overflow_allcounters, but I haven't done a
bisection search in 2.6.38 to see if there are any others. I upgraded to
kernel 2.6.39, ctests/overflow_allcounters is the only unit test
failure, but it "only" hard-locks the perf events infrastructure rather
than the whole machine. The unit tests's process hangs with 0% cpu util
and becomes unkillable, and any later process attempting to use perf
events suffers the same fate. The mmap+rdpmc support is apparently
disabled in 2.6.39, in that index=0 for all time. The self_count test
runs without errors and reports monotonically increasing values, but I
never attempted to verify that the starting count was meaningful.
For now I've rolled back to 2.6.38, since the later version is a step
backwards for my needs. With the kernel module I mentioned before,
user-level rdpmc seems to stay enabled indefinitely and self_count runs
without errors. I've extended the test slightly to run fib with
n={30,35,40}, to track which counter number it used directly (if any),
and to report the deltas between measurements. Here's the output I get:
> $ ./self_count
> raw=0xcd73 offset=0x0, ena=36278 run=36278 idx=-1 direct=0
> 52595 PERF_COUNT_HW_CPU_CYCLES (delta= cd73)
> raw=0xffff811d738b offset=0x7fffffff, ena=36278 run=36278 idx=0 direct=1
> 281474995417994 PERF_COUNT_HW_CPU_CYCLES (delta= 10000011ca617)
> raw=0xffff8d588633 offset=0x7fffffff, ena=36278 run=36278 idx=0 direct=1
> 281475200615986 PERF_COUNT_HW_CPU_CYCLES (delta= c3b12a8)
> raw=0xffff94aa8789 offset=0xfffffffe, ena=36278 run=36278 idx=0 direct=1
> 281477470914439 PERF_COUNT_HW_CPU_CYCLES (delta= 87520155)
> raw=0xffff95c33ede offset=0xfffffffe, ena=36278 run=36278 idx=0 direct=1
> 281477489311452 PERF_COUNT_HW_CPU_CYCLES (delta= 118b755)
> raw=0xffffa1ea0c92 offset=0xfffffffe, ena=36278 run=36278 idx=0 direct=1
> 281477693181072 PERF_COUNT_HW_CPU_CYCLES (delta= c26cdb4)
> raw=0xffffa8e995ed offset=0x17ffffffd, ena=36278 run=36278 idx=0 direct=1
> 281479958074858 PERF_COUNT_HW_CPU_CYCLES (delta= 86ff895a)
> raw=0xffffaa0262b9 offset=0x17ffffffd, ena=36278 run=36278 idx=0 direct=1
> 281479976477366 PERF_COUNT_HW_CPU_CYCLES (delta= 118cccc)
> raw=0xffffb6284921 offset=0x17ffffffd, ena=36278 run=36278 idx=0 direct=1
> 281480180287774 PERF_COUNT_HW_CPU_CYCLES (delta= c25e668)
Judging from the above, the offset does seem to be broken, truncated to
32 bits, perhaps? If I force to always call read() then it makes more sense:
> $ ./self_count
> raw=0xda66 offset=0x0, ena=39065 run=39065 idx=-1 direct=0
> 55910 PERF_COUNT_HW_CPU_CYCLES (delta= da66)
> raw=0x11dc60e offset=0x0, ena=10052007 run=10052007 idx=-1 direct=0
> 18728462 PERF_COUNT_HW_CPU_CYCLES (delta= 11ceba8)
> raw=0xd590016 offset=0x0, ena=120077612 run=120077612 idx=-1 direct=0
> 223936534 PERF_COUNT_HW_CPU_CYCLES (delta= c3b3a08)
> raw=0x9466a0de offset=0x0, ena=1334882738 run=1334882738 idx=-1 direct=0
> 2489753822 PERF_COUNT_HW_CPU_CYCLES (delta= 870da0c8)
> raw=0x957f95c4 offset=0x0, ena=1344755931 run=1344755931 idx=-1 direct=0
> 2508166596 PERF_COUNT_HW_CPU_CYCLES (delta= 118f4e6)
> raw=0xa1b53a47 offset=0x0, ena=1454582523 run=1454582523 idx=-1 direct=0
> 2713008711 PERF_COUNT_HW_CPU_CYCLES (delta= c35a483)
The counter itself seems to work fine, though, and I'd only be using it
for deltas anyway.
Ryan
------------------------------------------------------------------------------
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
_______________________________________________
perfmon2-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel