[Intel-gfx] [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU

2015-06-04 Thread Robert Bragg
On Wed, May 27, 2015 at 4:39 PM, <> wrote:
> On Thu, May 21, 2015 at 12:17:48AM +0100, Robert Bragg wrote:
>> >
>> > So for me the 'natural' way to represent this in perf would be through
>> > event groups. Create a perf event for every single event -- yes this is
>> > 53 events.
>>
>> So when I was first looking at this work I had considered the
>> possibility of separate events, and these are some of the things that
>> in the end made me forward the hardware's raw report data via a single
>> event instead...
>>
>> There are 100s of possible B counters depending on the MUX
>> configuration + fixed-function logic in addition to the A counters.
>
> There's only 8 B counters, what there is 100s of is configurations, and
> that's no different from any other PMU. There's 100s of events to select
> on the CPU side as well - yet only 4 (general purpose) counters (8 if
> you disable HT on recent machines) per CPU.

To clarify one thing here; In the PRM you'll see a reference to
'Reserved' slots in the report layouts and these effectively
correspond to a further 8 configurable counters. These further
counters get referred to as 'C' counters for 'custom' and similar to
the B counters they are defined based on the MUX configuration except
they skip the boolean logic pipeline.

Thanks for the comparison with the CPU.

My main thought here is to beware of considering these configurable
counters as 'general purpose' if that might also imply orthogonality.
All our configurable counters build on top of a common/shared MUX
configuration so I'd tend to think of them more like a 'general
purpose counter set'.

>
> Of course, you could create more than 8 events at any one time, and then
> perf will round-robin the configuration for you.

This hardware really isn't designed to allow round-robin
reconfiguration. If we change the metric set then we will loose the
aggregate value of our B counters. Besides writing the large MUX
config + boolean logic state we would also need to save the current
aggregated counter values only maintained by the OA unit so they could
be restored later as well as any internal state that's simply not
accessible to us. If we reconfigure to try and round-robin between
sets of counters each reconfiguration will trash state that we have no
way of restoring. The counters have to no longer be in use if there's
no going back once we reconfigure.

>
>> A
>> choice would need to be made about whether to expose events for the
>> configurable counters that aren't inherently associated with any
>> semantics, or instead defining events for counters with specific
>> semantics (with 100s of possible counters to define). The later would
>> seem more complex for userspace and the kernel if they both now have
>> to understand the constraints on what counters can be used together.
>
> Again, no different from existing PMUs. The most 'interesting' part of
> any PMU driver is event scheduling.
>
> The explicit design of perf was to put event scheduling _in_ the kernel
> and not allow userspace direct access to the hardware (as previous PMU
> models did).

I'm thinking your 'event scheduling' here refers to the kernel being
able to handle multiple concurrent users of the same pmu, but then we
may be talking cross purposes...

I was describing how (if we exposed events for counters with well
defined semantics - as opposed to just 16 events for the B/C counters)
the driver would need to do some kind of matching based on the events
added to a group to deduce which MUX configuration should be used and
saying that in this case userspace would need to be very aware of
which events belong to a specific MUX configuration.

I don't think we have the hardware flexibility to support scheduling
in terms of multiple concurrent users, unless they happen to want
exactly the same configuration.

I also don't think we have a use case for accessing gpu metrics where
concurrent access is really important. It's possible to think of cases
where it might be nice, except they are unlikely to involve the same
configuration so given the current OA unit design I haven't had plans
to try and support concurrent access and haven't had any requests for
it so far.

>
>> I
>> guess with either approach we would also need to have some form of
>> dedicated group leader event accepting attributes for configuring the
>> state that affects the group as a whole, such as the counter
>> configuration (3D vs GPGPU vs media etc).
>
> That depends a bit on how flexible you want to switch between these
> modes; with the cost being in the 100s of MMIO register writes, it might
> just not make sense to make this too dynamic.
>
> An option might be to select the mode through the PMU driver's sysfs
> files. Another option might be to expose the B counters 3 times and have
> each set be mutually exclusive. That is, as soon as you've created a 3D
> event, you can no longer create GPGPU/Media events.

Hmm, interesting idea to have global/pmu state be controlled via 

[RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU

2015-05-27 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> > As it is currently the kernel doesn't need to know anything about the 
> > semantics of individual counters being selected, so it's currently 
> > convenient 
> > that we can aim to maintain all the counter meta data we have in userspace 
> > according to the changing needs of tools or drivers (e.g. names, 
> > descriptions, 
> > units, max values, normalization equations), de-coupled from the kernel, 
> > instead of splitting it between the kernel and userspace.
> 
> And that fully violates the design premise of perf. The kernel should be in 
> control of resources, not userspace.
> 
> If we'd have put userspace in charge, we could now not profile the same task 
> by 
> two (or more) different observers. But with kernel side counter management 
> that 
> is no problem at all.

Beyond that there are a couple of other advantages of managing performance 
metrics 
via the kernel:

2)

Per task/workload/user separation of events, for example you can do:

perf record make bzImage
perf stat make bzImage

These will only profile/count in the child task hierarchy spawned by the shell. 
Multiple such sessions can be started and the perf sampling/counting of the 
separate workloads is independent of each other.

3)

Finegrained security model: you can chose which events to expose to what 
privilege 
levels. You can also work around hardware bugs without complicating tooling.

4)

Seemless integration of the various PMU and other event metrics with each 
other: 
you can start a CPU PMU event, a trace event, a software event and an uncore 
event, which all output into the same coherent, ordered ring-buffer, 
timestamped 
and correlated consistently, etc.

5)

The unified event model on the kernel side also brings with itself 100 KLOC of 
rich, unified tooling in tools/perf/, which tries to keep it all together and 
functional.



The cost of all that is having to implement kernel support for it, but once 
that 
initial hurdle is taken, it's well worth it! A PMU driver can be as simple as 
offering only a single system-wide counter.

Thanks,

Ingo


[RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU

2015-05-27 Thread Peter Zijlstra
On Thu, May 21, 2015 at 12:17:48AM +0100, Robert Bragg wrote:
> >
> > So for me the 'natural' way to represent this in perf would be through
> > event groups. Create a perf event for every single event -- yes this is
> > 53 events.
> 
> So when I was first looking at this work I had considered the
> possibility of separate events, and these are some of the things that
> in the end made me forward the hardware's raw report data via a single
> event instead...
> 
> There are 100s of possible B counters depending on the MUX
> configuration + fixed-function logic in addition to the A counters.

There's only 8 B counters, what there is 100s of is configurations, and
that's no different from any other PMU. There's 100s of events to select
on the CPU side as well - yet only 4 (general purpose) counters (8 if
you disable HT on recent machines) per CPU.

Of course, you could create more than 8 events at any one time, and then
perf will round-robin the configuration for you.

> A
> choice would need to be made about whether to expose events for the
> configurable counters that aren't inherently associated with any
> semantics, or instead defining events for counters with specific
> semantics (with 100s of possible counters to define). The later would
> seem more complex for userspace and the kernel if they both now have
> to understand the constraints on what counters can be used together.

Again, no different from existing PMUs. The most 'interesting' part of
any PMU driver is event scheduling.

The explicit design of perf was to put event scheduling _in_ the kernel
and not allow userspace direct access to the hardware (as previous PMU
models did).

> I
> guess with either approach we would also need to have some form of
> dedicated group leader event accepting attributes for configuring the
> state that affects the group as a whole, such as the counter
> configuration (3D vs GPGPU vs media etc). 

That depends a bit on how flexible you want to switch between these
modes; with the cost being in the 100s of MMIO register writes, it might
just not make sense to make this too dynamic.

An option might be to select the mode through the PMU driver's sysfs
files. Another option might be to expose the B counters 3 times and have
each set be mutually exclusive. That is, as soon as you've created a 3D
event, you can no longer create GPGPU/Media events.

> I'm not sure where we would
> handle the context-id + drm file descriptor attributes for initiating
> single context profiling but guess we'd need to authenticate each
> individual event open.

Event open is a slow path, so that should not be a problem, right?

> It's not clear if we'd configure the report
> layout via the group leader, or try to automatically choose the most
> compact format based on the group members. I'm not sure how pmus
> currently handle the opening of enabled events on an enabled group but

With the PERF_FORMAT_GROUP layout changing in-flight. I would recommend
not doing that -- decoding the output will be 'interesting' but not
impossible.

> I think there would need to be limitations in our case that new
> members can't result in a reconfigure of the counters if that might
> loose the current counter values known to userspace.

I'm not entirely sure what you mean here.

> From a user's pov, there's no real freedom to mix and match which
> counters are configured together, and there's only some limited
> ability to ignore some of the currently selected counters by not
> including them in reports.

It is 'impossible' to create a group that is not programmable. That is,
the pmu::event_init() call _should_ verify that the addition of the
event to the group (as given by event->group_leader) is valid.

> Something to understand here is that we have to work with sets of
> pre-defined MUX + fixed-function logic configurations that have been
> validated to give useful metrics for specific use cases, such as
> benchmarking 3D rendering, GPGPU or media workloads.

This is fine; as stated above. Since these are limited pieces of
'firmware' which are expensive to load, you don't have to do a fully
dynamic solution here.

> As it is currently the kernel doesn't need to know anything about the
> semantics of individual counters being selected, so it's currently
> convenient that we can aim to maintain all the counter meta data we
> have in userspace according to the changing needs of tools or drivers
> (e.g. names, descriptions, units, max values, normalization
> equations), de-coupled from the kernel, instead of splitting it
> between the kernel and userspace.

And that fully violates the design premise of perf. The kernel should be
in control of resources, not userspace.

If we'd have put userspace in charge, we could now not profile the same
task by two (or more) different observers. But with kernel side counter
management that is no problem at all.

> A benefit of being able to change the report size is to reduce memory
> bandwidth usage that can skew 

[Intel-gfx] [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU

2015-05-21 Thread Daniel Vetter
On Thu, May 21, 2015 at 12:17:48AM +0100, Robert Bragg wrote:
> On Tue, May 19, 2015 at 3:53 PM, Peter Zijlstra  
> wrote:
> > On Fri, May 15, 2015 at 02:07:29AM +0100, Robert Bragg wrote:
> >> On Fri, May 8, 2015 at 5:24 PM, Peter Zijlstra  
> >> wrote:
> >> > On Thu, May 07, 2015 at 03:15:43PM +0100, Robert Bragg wrote:
> >> >
> >> >> I've changed the uapi for configuring the i915_oa specific attributes
> >> >> when calling perf_event_open(2) whereby instead of cramming lots of
> >> >> bitfields into the perf_event_attr config members, I'm now
> >> >> daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
> >> >> config member that's extensible and validated in the same way as the
> >> >> perf_event_attr struct. I've found this much nicer to work with while
> >> >> being neatly extensible too.
> >> >
> >> > This worries me a bit.. is there more background for this?
> >>
> >> Would it maybe be helpful to see the before and after? I had kept this
> >> uapi change in a separate patch for a while locally but in the end
> >> decided to squash it before sending out my updated series.
> >>
> >> Although I did find it a bit awkward with the bitfields, I was mainly
> >> concerned about the extensibility of packing logically separate
> >> attributes into the config members and had heard similar concerns from
> >> a few others who had been experimenting with my patches too.
> >>
> >> A few simple attributes I can think of a.t.m that we might want to add
> >> in the future are:
> >> - control of the OABUFFER size
> >> - a way to ask the kernel to collect reports at the beginning and end
> >> of batch buffers, in addition to periodic reports
> >> - alternative ways to uniquely identify a context to support tools
> >> profiling a single context not necessarily owned by the current
> >> process
> >>
> >> It could also be interesting to expose some counter configuration
> >> through these attributes too. E.g. on Broadwell+ we have 14 'Flexible
> >> EU' counters included in the OA unit reports, each with a 16bit
> >> configuration.
> >>
> >> In a more extreme case it might also be useful to allow userspace to
> >> specify a complete counter config, which (depending on the
> >> configuration) could be over 100 32bit values to select the counter
> >> signals + configure the corresponding combining logic.
> >>
> >> Since this pmu is in a device driver it also seemed reasonably
> >> appropriate to de-couple it slightly from the core perf_event_attr
> >> structure by allowing driver extensible attributes.
> >>
> >> I wonder if it might be less worrisome if the i915_oa_copy_attr() code
> >> were instead a re-usable utility perhaps maintained in events/core.c,
> >> so if other pmu drivers were to follow suite there would be less risk
> >> of a mistake being made here?
> >
> >
> > So I had a peek at:
> >
> >   
> > https://01.org/sites/default/files/documentation/observability_performance_counters_haswell.pdf
> >
> > In an attempt to inform myself of how the hardware worked. But the
> > document is rather sparse (and does not include broadwell).
> 
> Thanks. Sorry I can't easily fix this myself, but there is work
> ongoing to update this documentation. In the interim I can try and
> cover some of the key details here...

Yeah the docs are decidely less than stellar and especially lack
examples for how to use the flexible counters properly :( A few additional
comments from me below.

> > So from what I can gather there's two ways to observe the counters,
> > through MMIO or trough the ring-buffer, which in turn seems triggered by
> > a MI_REPORT_PERF_COUNT command.
> 
> There are three ways; mmio, MI_REPORT_PERF_COUNT via the ring-buffer
> and periodic sampling where the HW writes into a circular 'oabuffer'.
> 
> I think it's best to discount mmio as a debug feature since none of
> the counters are useful in isolation and mmio subverts the latching
> mechanism we get with the other two methods. We typically at least
> want to relate a counter to the number of clock cycles elapsed or the
> gpu time, or thread spawn counts.
> 
> This pmu  driver is primarily for exposing periodic metrics, while
> it's up to applications to choose to emit MI_REPORT_PERF_COUNT
> commands as part of their command streams so I think we can mostly
> ignore MI_REPORT_PERF_COUNT here too.

Yeah this is a crucial bit here - userspace inserts MI_REPORT_PERF_COUNT
into the gpu batchbuffer. And the gpu stores the samples at the userspace
provided address in the per-process gpu address space. Which means that
the kernel never ever sees these samples, nor can it get (efficiently at
least) at the target memory.

The mid-batchbuffer placement of perf samples is important since wrapping
just the entire batch is way too coarse. And timer-based sampling isn't
aligned to the different cmds, which means you can't restrict per sampling
to e.g. only a specific shader. Right now the gpu doesn't provide any
means to restrict sampling except emitting a full 

[RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU

2015-05-21 Thread Robert Bragg
On Tue, May 19, 2015 at 3:53 PM, Peter Zijlstra  wrote:
> On Fri, May 15, 2015 at 02:07:29AM +0100, Robert Bragg wrote:
>> On Fri, May 8, 2015 at 5:24 PM, Peter Zijlstra  
>> wrote:
>> > On Thu, May 07, 2015 at 03:15:43PM +0100, Robert Bragg wrote:
>> >
>> >> I've changed the uapi for configuring the i915_oa specific attributes
>> >> when calling perf_event_open(2) whereby instead of cramming lots of
>> >> bitfields into the perf_event_attr config members, I'm now
>> >> daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
>> >> config member that's extensible and validated in the same way as the
>> >> perf_event_attr struct. I've found this much nicer to work with while
>> >> being neatly extensible too.
>> >
>> > This worries me a bit.. is there more background for this?
>>
>> Would it maybe be helpful to see the before and after? I had kept this
>> uapi change in a separate patch for a while locally but in the end
>> decided to squash it before sending out my updated series.
>>
>> Although I did find it a bit awkward with the bitfields, I was mainly
>> concerned about the extensibility of packing logically separate
>> attributes into the config members and had heard similar concerns from
>> a few others who had been experimenting with my patches too.
>>
>> A few simple attributes I can think of a.t.m that we might want to add
>> in the future are:
>> - control of the OABUFFER size
>> - a way to ask the kernel to collect reports at the beginning and end
>> of batch buffers, in addition to periodic reports
>> - alternative ways to uniquely identify a context to support tools
>> profiling a single context not necessarily owned by the current
>> process
>>
>> It could also be interesting to expose some counter configuration
>> through these attributes too. E.g. on Broadwell+ we have 14 'Flexible
>> EU' counters included in the OA unit reports, each with a 16bit
>> configuration.
>>
>> In a more extreme case it might also be useful to allow userspace to
>> specify a complete counter config, which (depending on the
>> configuration) could be over 100 32bit values to select the counter
>> signals + configure the corresponding combining logic.
>>
>> Since this pmu is in a device driver it also seemed reasonably
>> appropriate to de-couple it slightly from the core perf_event_attr
>> structure by allowing driver extensible attributes.
>>
>> I wonder if it might be less worrisome if the i915_oa_copy_attr() code
>> were instead a re-usable utility perhaps maintained in events/core.c,
>> so if other pmu drivers were to follow suite there would be less risk
>> of a mistake being made here?
>
>
> So I had a peek at:
>
>   
> https://01.org/sites/default/files/documentation/observability_performance_counters_haswell.pdf
>
> In an attempt to inform myself of how the hardware worked. But the
> document is rather sparse (and does not include broadwell).

Thanks. Sorry I can't easily fix this myself, but there is work
ongoing to update this documentation. In the interim I can try and
cover some of the key details here...

>
> So from what I can gather there's two ways to observe the counters,
> through MMIO or trough the ring-buffer, which in turn seems triggered by
> a MI_REPORT_PERF_COUNT command.

There are three ways; mmio, MI_REPORT_PERF_COUNT via the ring-buffer
and periodic sampling where the HW writes into a circular 'oabuffer'.

I think it's best to discount mmio as a debug feature since none of
the counters are useful in isolation and mmio subverts the latching
mechanism we get with the other two methods. We typically at least
want to relate a counter to the number of clock cycles elapsed or the
gpu time, or thread spawn counts.

This pmu  driver is primarily for exposing periodic metrics, while
it's up to applications to choose to emit MI_REPORT_PERF_COUNT
commands as part of their command streams so I think we can mostly
ignore MI_REPORT_PERF_COUNT here too.

>
> [ Now the document talks about shortcomings of this scheme, where the
> MI_REPORT_PERF_COUNT command can only be placed every other command, but
> a single command can contain so much computation that this is not fine
> grained enough -- leading to the suggestion of a timer/cycle based
> reporting, but that is not actually mentioned afaict ]

It's in there, though unfortunately the documentation isn't very clear
currently. The 'Performance Event Counting' section seems to be the
appropriate place to introduce the periodic sampling feature, but just
reading it myself it really only talks about the limitations of
reporting like you say. I'll see if I can prod to get this improved.

If you see page 18 "Performance Statistics Registers":

OACONTROL has a 'Timer Period' field and 'Timer Enable'
OABUFFER points to a circular buffer for periodic reports
OASTATUS1/2 contain the head/tail pointers


>
> Now the MI_REPORT_PERF_COUNT can select a vector (Counter Select) of
> which events it will write out.
>
> This covers the regular 'A' counters. 

[RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU

2015-05-19 Thread Peter Zijlstra
On Fri, May 15, 2015 at 02:07:29AM +0100, Robert Bragg wrote:
> On Fri, May 8, 2015 at 5:24 PM, Peter Zijlstra  
> wrote:
> > On Thu, May 07, 2015 at 03:15:43PM +0100, Robert Bragg wrote:
> >
> >> I've changed the uapi for configuring the i915_oa specific attributes
> >> when calling perf_event_open(2) whereby instead of cramming lots of
> >> bitfields into the perf_event_attr config members, I'm now
> >> daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
> >> config member that's extensible and validated in the same way as the
> >> perf_event_attr struct. I've found this much nicer to work with while
> >> being neatly extensible too.
> >
> > This worries me a bit.. is there more background for this?
> 
> Would it maybe be helpful to see the before and after? I had kept this
> uapi change in a separate patch for a while locally but in the end
> decided to squash it before sending out my updated series.
> 
> Although I did find it a bit awkward with the bitfields, I was mainly
> concerned about the extensibility of packing logically separate
> attributes into the config members and had heard similar concerns from
> a few others who had been experimenting with my patches too.
> 
> A few simple attributes I can think of a.t.m that we might want to add
> in the future are:
> - control of the OABUFFER size
> - a way to ask the kernel to collect reports at the beginning and end
> of batch buffers, in addition to periodic reports
> - alternative ways to uniquely identify a context to support tools
> profiling a single context not necessarily owned by the current
> process
> 
> It could also be interesting to expose some counter configuration
> through these attributes too. E.g. on Broadwell+ we have 14 'Flexible
> EU' counters included in the OA unit reports, each with a 16bit
> configuration.
> 
> In a more extreme case it might also be useful to allow userspace to
> specify a complete counter config, which (depending on the
> configuration) could be over 100 32bit values to select the counter
> signals + configure the corresponding combining logic.
> 
> Since this pmu is in a device driver it also seemed reasonably
> appropriate to de-couple it slightly from the core perf_event_attr
> structure by allowing driver extensible attributes.
> 
> I wonder if it might be less worrisome if the i915_oa_copy_attr() code
> were instead a re-usable utility perhaps maintained in events/core.c,
> so if other pmu drivers were to follow suite there would be less risk
> of a mistake being made here?


So I had a peek at:

  
https://01.org/sites/default/files/documentation/observability_performance_counters_haswell.pdf

In an attempt to inform myself of how the hardware worked. But the
document is rather sparse (and does not include broadwell).

So from what I can gather there's two ways to observe the counters,
through MMIO or trough the ring-buffer, which in turn seems triggered by
a MI_REPORT_PERF_COUNT command.

[ Now the document talks about shortcomings of this scheme, where the
MI_REPORT_PERF_COUNT command can only be placed every other command, but
a single command can contain so much computation that this is not fine
grained enough -- leading to the suggestion of a timer/cycle based
reporting, but that is not actually mentioned afaict ]

Now the MI_REPORT_PERF_COUNT can select a vector (Counter Select) of
which events it will write out.

This covers the regular 'A' counters. Is this correct?

Then there are the 'B' counters, which appear to be programmable through
the CEC MMIO registers.

These B events can also be part of the MI_REPORT_PERF_COUNT vector.

Right?


So for me the 'natural' way to represent this in perf would be through
event groups. Create a perf event for every single event -- yes this is
53 events.

Use the MMIO reads for the regular read() interface, and use a hrtimer
placing MI_REPORT_PERF_COUNT commands, with a counter select mask
covering the all events in the current group, for sampling.

Then obtain the vector values using PERF_SAMPLE_READ and
PERF_FORMAT_READ -- and use the read txn support to consume the
ring-buffer vectors instead of the MMIO registers.

You can use the perf_event_attr::config to select the counter (A0-A44,
B0-B7) and use perf_event_attr::config1 (low and high dword) for the
corresponding CEC registers.


This does not require random per driver ABI extentions for
perf_event_attr, nor your custom output format.

Am I missing something obvious here?


[RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU

2015-05-18 Thread Robert Bragg
On Fri, May 8, 2015 at 5:21 PM, Peter Zijlstra  wrote:
>
> So I've not yet went through the entire series; but I'm wondering if its
> at all possible to re-use some of this work:
>
> lkml.kernel.org/r/1428453299-19121-1-git-send-email-sukadev at 
> linux.vnet.ibm.com
>
> That's for a Power8 HV call that can basically return an array of
> values; which on a superficial level sounds a bit like what this GPU
> hardware does.

Thanks for this pointer.

I think the main similarity here is the ability to capture multiple
counters consistent for the same point in time, but in our case we
don't have an explicitly controlled transaction mechanism like this.

Although we can collect a large set of counters in a latched fashion -
so they are self consistent - the selection of counters included in
our OA unit reports is more rigid.

Most of our counters aren't independently aggregated, they are derived
from signals selected as part of the OA unit configuration and the
values are only maintained by the OA unit itself, so a
re-configuration to select different signals will discard the counter
values of currently selected signals.

afik re-configuring our signal selection is also relatively slow too
(I was told this last week at least, but I haven't tested it myself)
and so it's really geared towards applications or tools choosing a
configuration to maintain for a relatively long time while profiling a
workload.

I think the other big difference here is that we don't have a way to
explicitly trigger a report to be written from the cpu. (Although we
can read the OA counters via mmio, it's only intended for debug
purposes as this subverts the hw latching of counters) This means it
would be difficult to try and treat this like a transaction including
a fixed set of event->read()s without a way for pmu->commit_txn() to
trigger a report.

>
> Let me read more of this..

Thanks.

- Robert


[RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU

2015-05-15 Thread Robert Bragg
On Fri, May 8, 2015 at 5:24 PM, Peter Zijlstra  wrote:
> On Thu, May 07, 2015 at 03:15:43PM +0100, Robert Bragg wrote:
>
>> I've changed the uapi for configuring the i915_oa specific attributes
>> when calling perf_event_open(2) whereby instead of cramming lots of
>> bitfields into the perf_event_attr config members, I'm now
>> daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
>> config member that's extensible and validated in the same way as the
>> perf_event_attr struct. I've found this much nicer to work with while
>> being neatly extensible too.
>
> This worries me a bit.. is there more background for this?

Would it maybe be helpful to see the before and after? I had kept this
uapi change in a separate patch for a while locally but in the end
decided to squash it before sending out my updated series.

Although I did find it a bit awkward with the bitfields, I was mainly
concerned about the extensibility of packing logically separate
attributes into the config members and had heard similar concerns from
a few others who had been experimenting with my patches too.

A few simple attributes I can think of a.t.m that we might want to add
in the future are:
- control of the OABUFFER size
- a way to ask the kernel to collect reports at the beginning and end
of batch buffers, in addition to periodic reports
- alternative ways to uniquely identify a context to support tools
profiling a single context not necessarily owned by the current
process

It could also be interesting to expose some counter configuration
through these attributes too. E.g. on Broadwell+ we have 14 'Flexible
EU' counters included in the OA unit reports, each with a 16bit
configuration.

In a more extreme case it might also be useful to allow userspace to
specify a complete counter config, which (depending on the
configuration) could be over 100 32bit values to select the counter
signals + configure the corresponding combining logic.

Since this pmu is in a device driver it also seemed reasonably
appropriate to de-couple it slightly from the core perf_event_attr
structure by allowing driver extensible attributes.

I wonder if it might be less worrisome if the i915_oa_copy_attr() code
were instead a re-usable utility perhaps maintained in events/core.c,
so if other pmu drivers were to follow suite there would be less risk
of a mistake being made here?

Regards,
- Robert


[RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU

2015-05-08 Thread Peter Zijlstra
On Thu, May 07, 2015 at 03:15:43PM +0100, Robert Bragg wrote:

> I've changed the uapi for configuring the i915_oa specific attributes
> when calling perf_event_open(2) whereby instead of cramming lots of
> bitfields into the perf_event_attr config members, I'm now
> daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
> config member that's extensible and validated in the same way as the
> perf_event_attr struct. I've found this much nicer to work with while
> being neatly extensible too.

This worries me a bit.. is there more background for this?


[RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU

2015-05-08 Thread Peter Zijlstra

So I've not yet went through the entire series; but I'm wondering if its
at all possible to re-use some of this work:

lkml.kernel.org/r/1428453299-19121-1-git-send-email-sukadev at 
linux.vnet.ibm.com

That's for a Power8 HV call that can basically return an array of
values; which on a superficial level sounds a bit like what this GPU
hardware does.

Let me read more of this..


[RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU

2015-05-07 Thread Robert Bragg
This is an updated series adding support for an "i915_oa" perf PMU for
configuring the Intel Gen graphics Observability unit (Haswell only to
start with) and forwarding periodic counter reports as perf samples.

Compared to the series I sent out last year:

The driver is now hooked into context switching so we no longer require
a context to be pinned for the full lifetime of a perf event fd (when
profiling a specific context).

Not visible in the series, but I can say we've also gained some
experience from looking at enabling Broadwell within the same
architecture. There are some fiddly challenges ahead with enabling
Broadwell but I do feel comfortable that it can be supported in the same
kind of way via perf. I haven't updated my Broadwell branches for a
little while now but if anyone is interested I can share this code as a
point of reference if that's helpful.

As I've had interest from folks looking at developing tools based on
this interface to not require root, but we are following the precedent
of not exposing system wide metrics to unprivileged processes, I've
added a sysctl directly comparable to kernel.perf_event_paranoid
(dev.i915.oa_event_paranoid) that lets users optionally allow
unprivileged access to system wide gpu metrics.

This series is able to expose more than just the A (aggregating)
counters and demonstrates selecting a more counters that are useful when
benchmarking 3D render workloads. The expectation is to add further
configurations later geared towards media or gpgpu workloads for
example.

I've changed the uapi for configuring the i915_oa specific attributes
when calling perf_event_open(2) whereby instead of cramming lots of
bitfields into the perf_event_attr config members, I'm now
daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
config member that's extensible and validated in the same way as the
perf_event_attr struct. I've found this much nicer to work with while
being neatly extensible too.

I've made a few more (small) changes to core perf infrastructure:

I've added a PERF_EVENT_IOC_FLUSH ioctl that can be used to explicitly
ask the driver to flush buffered samples. In our case this makes sure to
forward all reports currently in the gpu mapped, circular, oabuffer as
perf samples. This issue was discussed a bit on LKML last year and
previously I was overloading our PMU's read() hook but decided that the
cleaner approach would be to add a dedicated ioctl instead.

To allow device-driver PMUs to define their own types for records
written to the perf circular buffer I've introduced a PERF_RECORD_DEVICE
type whereby drivers can then document their own header defining a driver
specific scheme for sub-types. This is then used in the i915_oa driver
for reporting hw status conditions such as OABUFFER overruns or report
lost conditions from the hw.


For examples of using the i915_oa driver I have a branch of Mesa that
enables support for the INTEL_performance_query extension based on this:

https://github.com/rib/drm   wip/rib/oa-hsw-4.0.0
https://github.com/rib/mesa  wip/rib/oa-hsw-4.0.0

For reference I sent out a corresponding series for the Mesa work for
review yesterday:

http://lists.freedesktop.org/archives/mesa-dev/2015-May/083519.html

I also have a minimal gputop tool that can both test Mesa's
INTEL_performance_query implementation to view per-context metrics or
it can view system wide gpu metrics collected directly from perf
(gputop/gputop-perf.c would be the main code of interest):

https://github.com/rib/gputop

If it's convenient for testing, my kernel patches can also be fetched
from here:

https://github.com/rib/linux  wip/rib/oa-hsw-4.0.0

One specific patch comment:

[RFC PATCH 11/11] WIP: drm/i915: constrain unit gating while using OA

I didn't want to hold up getting feedback due to this issue that I'm
currently investigating (since the effect on the driver should be
trivial) but I've included a work-in-progress patch since it does
address a known problem with collecting reliable periodic metrics.

Besides the last patch, I feel like this series is in pretty good shape
now and by testing it with Mesa and several profiling tools as well as
starting the work to enable Broadwell I feel quite happy with our
approach of leveraging perf infrastructure.

I'd really appreciate any feedback on the core perf changes I've made,
as well as general feedback on the PMU driver itself.

Since it's been quite a long time since I last sent out patches for this
work; in case it's helpful to refer back to some of the discussion last
year:

https://lkml.org/lkml/2014/10/22/462

For anyone interested to know more details about this hardware, this PRM
is a good starting point:

https://01.org/sites/default/files/documentation/
observability_performance_counters_haswell.pdf

Kind regards,
- Robert

Robert Bragg (11):
  perf: export perf_event_overflow
  perf: Add PERF_PMU_CAP_IS_DEVICE flag
  perf: Add PERF_EVENT_IOC_FLUSH ioctl
  perf: Add a PERF_RECORD_DEVICE