Re: [Intel-gfx] [RFC 0/6] Non perf based Gen Graphics OA unit driver

2015-10-20 Thread Robert Bragg
On Fri, Oct 16, 2015 at 10:43 AM, Peter Zijlstra  wrote:
> On Tue, Sep 29, 2015 at 03:39:03PM +0100, Robert Bragg wrote:
>> - We're bridging two complex architectures
>>
>> To review this work I think it will be relevant to have a good
>> general familiarity with Gen graphics (e.g. thinking about the OA
>> unit's interaction with the command streamer and execlist
>> scheduling) as well as our userspace architecture and how we're
>> consuming OA data within Mesa to implement the
>> INTEL_performance_query extension.
>>
>> On the flip side here, its necessary to understand the perf
>> userspace interface (for most this is hidden by tools so the details
>> aren't common knowledge) as well as the internal design, considering
>> that the PMU we're looking at seems to break several current design
>> assumptions. I can only claim a limited familiarity with perf's
>> design, just as a result of this work.
>
> Right; but a little effort and patience on both sides should get us
> there I think. At worst we'll both learn something new ;-)

I suppose I'm also concerned time is an important factor too. When it
comes to the OA metrics; we already have userspace tools that could be
more widely used by developers once we have an upstream interface.
Today perf isn't very well suited to our OA unit use case, and
although we may be able to change that - and I can try to help with
that - at this point I think I'd prefer not to block moving forward in
the mean time with the alternative i915 interface.

Although code-wise it didn't require any big changes to events/core to
get an initial perf based driver working for our use case, we have
raised a number of quite significant design questions and arguably cut
some corners, which could take a long time to resolve properly. I also
tend to think it's an open question at this stage whether it would
really be in everyone's interest to take perf in this direction
without a clear sense of the benefits it brings in comparison to the
complexity it may add.

It's also a bit awkward I had already started to move ahead with this
idea of upstreaming a non-perf based driver for the OA unit after
asking Daniel Vetter about this on IRC. There are some knock on
effects here too; Sourab Gupta is looking at building on this OA
driver and has now started adapting his work for this non-perf
approach.

>
>> - The current OA PMU driver breaks some significant design assumptions.
>>
>> Existing perf pmus are used for profiling work on a cpu and we're
>> introducing the idea of _IS_DEVICE pmus with different security
>> implications, the need to fake cpu-related data (such as user/kernel
>> registers) to fit with perf's current design, and adding _DEVICE
>> records as a way to forward device-specific status records.
>
> There are more devices with counters on than GPUs, so I think it might
> make sense to look at extending perf to better deal with this.

I wonder if it could be good to look at exposing some of the mmio
accessible Gen graphics counters before tackling a more complex case
like the OA unit. We have a number of counters that could be
interesting to sample periodically via a hrtimer, that require no
configuration, are global (so no need to specify a gpu context) but as
they relate to the GPU an _IS_DEVICE pmu would still be appropriate.
Some of these seem like they could be better suited to being exposed
via perf than OA unit counters so they might be a helpful stepping
stone.

>
>> The OA unit writes reports of counters into a circular buffer,
>> without involvement from the CPU, making our PMU driver the first of
>> a kind.
>
> Agreed, this is somewhat 'odd' from where we are today.
>
>> Perf supports groups of counters and allows those to be read via
>> transactions internally but transactions currently seem designed to
>> be explicitly initiated from the cpu (say in response to a userspace
>> read()) and while we could pull a report out of the OA buffer we
>> can't trigger a report from the cpu on demand.
>>
>> Related to being report based; the OA counters are configured in HW
>> as a set while perf generally expects counter configurations to be
>> orthogonal. Although counters can be associated with a group leader
>> as they are opened, there's no clear precedent for being able to
>> provide group-wide configuration attributes and no obvious solution
>> as yet that's expected to be acceptable to upstream and meets our
>> userspace needs.
>
> I'm not entirely sure what you mean with group-wide configuration
> attributes; could you elaborate?

Here I'm thinking of configuration details that conceptually relate to
a set of OA unit counters, not individual events/counters:

- The choice of 'metric set' which represents a MUX configuration +
boolean logic configuration for a set of counters that will be
included in the reports written by the 

Re: [Intel-gfx] [RFC 0/6] Non perf based Gen Graphics OA unit driver

2015-10-16 Thread Peter Zijlstra
On Fri, Oct 16, 2015 at 12:02:28PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra  wrote:
> 
> > > - We may be making some technical compromises a.t.m for the sake of
> > >   using perf.
> > > 
> > > perf_event_open() requires events to either relate to a pid or a
> > > specific cpu core, while our device pmu relates to neither.  Events
> > > opened with a pid will be automatically enabled/disabled according
> > > to the scheduling of that process - so not appropriate for us.
> > 
> > Right; the traditional cpu/pid mapping doesn't work well for devices;
> > but maybe, with some work, we can create something like that
> > global/local render context from it; although I've no clue what form
> > that would need at this time.
> 
> Could someone please help with some very basic questions, such as what the 
> hardware model of the 'OA' unit model is? How are OA registers set up, how 
> are 
> their values made accessible to the host side, etc.

Robert linked to:

  
https://01.org/sites/default/files/documentation/observability_performance_counters_haswell.pdf

In a previous posting. It has some info, but full documentation, is as
per the initial post, 'pending'.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 0/6] Non perf based Gen Graphics OA unit driver

2015-10-16 Thread Peter Zijlstra
On Tue, Sep 29, 2015 at 03:39:03PM +0100, Robert Bragg wrote:
> - We're bridging two complex architectures
> 
> To review this work I think it will be relevant to have a good
> general familiarity with Gen graphics (e.g. thinking about the OA
> unit's interaction with the command streamer and execlist
> scheduling) as well as our userspace architecture and how we're
> consuming OA data within Mesa to implement the
> INTEL_performance_query extension.
>
> On the flip side here, its necessary to understand the perf
> userspace interface (for most this is hidden by tools so the details
> aren't common knowledge) as well as the internal design, considering
> that the PMU we're looking at seems to break several current design
> assumptions. I can only claim a limited familiarity with perf's
> design, just as a result of this work.

Right; but a little effort and patience on both sides should get us
there I think. At worst we'll both learn something new ;-)

> - The current OA PMU driver breaks some significant design assumptions.
> 
> Existing perf pmus are used for profiling work on a cpu and we're
> introducing the idea of _IS_DEVICE pmus with different security
> implications, the need to fake cpu-related data (such as user/kernel
> registers) to fit with perf's current design, and adding _DEVICE
> records as a way to forward device-specific status records.

There are more devices with counters on than GPUs, so I think it might
make sense to look at extending perf to better deal with this.

> The OA unit writes reports of counters into a circular buffer,
> without involvement from the CPU, making our PMU driver the first of
> a kind.

Agreed, this is somewhat 'odd' from where we are today.

> Perf supports groups of counters and allows those to be read via
> transactions internally but transactions currently seem designed to
> be explicitly initiated from the cpu (say in response to a userspace
> read()) and while we could pull a report out of the OA buffer we
> can't trigger a report from the cpu on demand.
>
> Related to being report based; the OA counters are configured in HW
> as a set while perf generally expects counter configurations to be
> orthogonal. Although counters can be associated with a group leader
> as they are opened, there's no clear precedent for being able to
> provide group-wide configuration attributes and no obvious solution
> as yet that's expected to be acceptable to upstream and meets our
> userspace needs.

I'm not entirely sure what you mean with group-wide configuration
attributes; could you elaborate?

> We currently avoid using perf's grouping feature
> and forward OA reports to userspace via perf's 'raw' sample field.
> This suits our userspace well considering how coupled the counters
> are when dealing with normalizing. It would be inconvenient to split
> counters up into separate events, only to require userspace to
> recombine them. 

So IF you were using a group, a single read from the leader can return
you a vector of all values (PERF_FORMAT_GROUP), this avoids having to
do that recombine.

Another option would be to view the arrival of an OA vector in the
datastream as an 'event' and generate a PERF_RECORD_READ in the perf
buffer (which again can use the GROUP vector format).

> Related to counter orthogonality; we can't time share the OA unit,
> while event scheduling is a central design idea within perf for
> allowing userspace to open + enable more events than can be
> configured in HW at any one time.

So we have other PMUs that cannot do this; Gen OA would not be unique in
this. Intel PT for example only allows a single active event.

That said; earlier today I saw:

  
https://www.youtube.com/watch?v=9J3BQcAeHpI=PLe6I3NKr-I4J2oLGXhGOeBMEjh8h10jT3=7

where exactly this feature was mentioned as not fitting well into the
existing GPU performance interfaces (GL_AMD_performance_monitor /
GL_INTEL_performance_query).

So there is hardware (Nvidia) out there that does support this. Also
mentioned was that this hardware has global and local counters, where
the local ones are specific to a rendering context. That is not unlike
the per-cpu / per-task stuff perf does.

> The OA unit is not designed to
> allow re-configuration while in use. We can't reconfigure the OA
> unit without loosing internal OA unit state which we can't access
> explicitly to save and restore. Reconfiguring the OA unit is also
> relatively slow, involving ~100 register writes. From userspace Mesa
> also depends on a stable OA configuration when emitting
> MI_REPORT_PERF_COUNT commands and importantly the OA unit can't be
> disabled while there are outstanding MI_RPC commands lest we hang
> the command streamer.

Right; see the PERF_PMU_CAP_EXCLUSIVE stuff.

> - We may be making some technical 

Re: [Intel-gfx] [RFC 0/6] Non perf based Gen Graphics OA unit driver

2015-10-16 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> > - We may be making some technical compromises a.t.m for the sake of
> >   using perf.
> > 
> > perf_event_open() requires events to either relate to a pid or a
> > specific cpu core, while our device pmu relates to neither.  Events
> > opened with a pid will be automatically enabled/disabled according
> > to the scheduling of that process - so not appropriate for us.
> 
> Right; the traditional cpu/pid mapping doesn't work well for devices;
> but maybe, with some work, we can create something like that
> global/local render context from it; although I've no clue what form
> that would need at this time.

Could someone please help with some very basic questions, such as what the 
hardware model of the 'OA' unit model is? How are OA registers set up, how are 
their values made accessible to the host side, etc.

I see some references to 'OA' registers in:

  
https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-bdw-vol03-gpu_overview_1.pdf

and I tried to find a more high level description in:

   
https://01.org/linuxgraphics/documentation/hardware-specification-prms/2014-2015-intel-processors-based-broadwell-platform

but couldn't find it. (Maybe it's my fault!)

Thanks,

Ingo
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 0/6] Non perf based Gen Graphics OA unit driver

2015-10-16 Thread Robert Bragg
On Fri, Oct 16, 2015 at 11:33 AM, Peter Zijlstra  wrote:
> On Fri, Oct 16, 2015 at 12:02:28PM +0200, Ingo Molnar wrote:
>>
>> * Peter Zijlstra  wrote:
>>
>> > > - We may be making some technical compromises a.t.m for the sake of
>> > >   using perf.
>> > >
>> > > perf_event_open() requires events to either relate to a pid or a
>> > > specific cpu core, while our device pmu relates to neither.  Events
>> > > opened with a pid will be automatically enabled/disabled according
>> > > to the scheduling of that process - so not appropriate for us.
>> >
>> > Right; the traditional cpu/pid mapping doesn't work well for devices;
>> > but maybe, with some work, we can create something like that
>> > global/local render context from it; although I've no clue what form
>> > that would need at this time.
>>
>> Could someone please help with some very basic questions, such as what the
>> hardware model of the 'OA' unit model is? How are OA registers set up, how 
>> are
>> their values made accessible to the host side, etc.
>
> Robert linked to:
>
>   
> https://01.org/sites/default/files/documentation/observability_performance_counters_haswell.pdf
>
> In a previous posting. It has some info, but full documentation, is as
> per the initial post, 'pending'.

There is now also some Broadwell documentation here:

https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-bdw-vol14-observability.pdf

Unfortunately though a mistake was made by the documentation team when
generating the PDF which unintentionally stripped out a lot of
information so it's not very helpful a.t.m. I've let them know about
some of the issues, but I'm not sure a.t.m when it may be updated.

I tried to fill in the gaps in some of our earlier conversations, so
maybe also go over those for more details too.

Otherwise the best reference is probably my code currently, either the
RFC patches I sent most recently which at least cover up to Haswell,
or the wip/rib/oa-next branch here: https://github.com/rib/linux. The
lastest perf based driver is currently in the archive/rib/oa-core-perf
branch for reference too.

- Robert
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 0/6] Non perf based Gen Graphics OA unit driver

2015-09-30 Thread Robert Bragg
On Wed, Sep 30, 2015 at 9:30 AM, Chris Wilson 
wrote:

> On Tue, Sep 29, 2015 at 03:39:03PM +0100, Robert Bragg wrote:
> > Updating Mesa and GPU Top to experiment with this was straightforward
> > given the similarity to the perf interface.  The main difference is that
> > it only supports forwarding metrics via read()s instead of an mmaped
> > circular buffer. As mentioned above, I think that suits this well, and
> > requires no additional copying of data. I think the userspace code has
> > ended up being a little simpler too.
>
> Did you try updating the existing perf based overlay?
>

I don't recall the overlay attempting to read OA counters, but potentially
it could be quite nice to add support - sorry I hadn't considered that so
far.

I don't believe being perf based or not will affect the effort to do this
though. The perf based driver doesn't handle OA counter normalization in
the kernel so userspace needs to be able to handle that - which is probably
the bigger effort.

Something to note here about your early pmu driver, is that it was notably
for counters that were explicitly sampled from the cpu using a hrtimer via
mmio. I think they were a better fit for the existing perf design than the
OA unit, primarily because they were explicitly read from the cpu and each
counter was very independent.


>
> > Overall the driver currently isn't much more code than with perf (~200
> > lines).
> >
> > Personally my gut feeling a.t.m, is that we should aim to move forward
> > independent from perf.
> >
> > I'd really appreciate some feedback from others on this though.
> >
> > Daniel and Chris; although I think it made sense at the outset to try
> > and use perf, in light of the above would you be open to a non-perf
> > based driver for the OA unit?
>
> No. I strongly dislike that they will be multiple incompatibile perf
> interfaces and strongly like the coupling with other profiling that
> comes with perf - i.e. we very much want to simultaneously sample CPU
> and GPU workloads along with other devices, that information is much
> more useful to me for the purposes of scheduling work and maximising
> concurrency than optimising shaders.
>

In this case I don't think there's inherently any more compatibility that
comes from using perf or not - no existing userspace will Just Work™ with
the perf based OA driver.

I think some of the cases you're referring to may be ok to expose via the
existing perf infrastructure, but I'm currently enabling the OA unit which
poses some unique difficulties I've tried to explain.

A guiding differentiator may be whether or not the counter is orthogonal
(in terms of configuration and normalization) and explicitly readable from
the cpu, as to whether the existing perf pmu infrastructure is a good fit.

'i915 perf' shows my lack of imagination naming this and maybe another name
could imply a more limited scope. I.e. on a case by case basis, when
looking to expose a new counters we can still evaluate whether it makes
sense to expose via the existing perf infrastructure or this.

- Robert


> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [RFC 0/6] Non perf based Gen Graphics OA unit driver

2015-09-29 Thread Robert Bragg
After some recent progress enabling the Observation Architecture unit
for Gen8+, we can hopefully paint a fairly complete picture of the
requirements for supporting the unit from Haswell to Skylake and so
I'm looking again at the challenges in upstreaming this work.

Considering this, it looked like it could be worthwhile experimenting
with a non-perf based driver for the OA unit and I'm hoping to explain
why and how it went as well as request some feedback on whether we
should aim to move forward without perf.

Besides the patches forwarded here, a branch can be found for reference
here:

  https://github.com/rib/linux - wip/rib/oa-without-perf branch

I created corresponding branches for Mesa and GPU Top to test this here
(same branch names):

  https://github.com/rib/mesa
  https://github.com/rib/gputop

Here I've only included the patches up to an initial Haswell driver,
although the wip/rib/oa-without-perf branch on github includes support
for Gen8+. Please let me know if it would be helpful to forward more.

At this point I have two drivers at feature parity; one based on perf,
one not. Technically they're very similar and the patches are split to
hopefully be quite comparable. My latest perf-based work is under
wip/rib/oa-next branches in the above repos.


So, these are the concerns I have a.t.m about upstreaming this work:
   

- We're bridging two complex architectures

To review this work I think it will be relevant to have a good
general familiarity with Gen graphics (e.g. thinking about the OA
unit's interaction with the command streamer and execlist
scheduling) as well as our userspace architecture and how we're
consuming OA data within Mesa to implement the
INTEL_performance_query extension.

On the flip side here, its necessary to understand the perf
userspace interface (for most this is hidden by tools so the details
aren't common knowledge) as well as the internal design, considering
that the PMU we're looking at seems to break several current design
assumptions. I can only claim a limited familiarity with perf's
design, just as a result of this work.


- Limited documentation for the OA unit:

Not unique to the OA unit but I think having a driver that extends
outside of the graphics stack, into the core perf infrastructure
probably requires more comprehensive HW + graphics stack
documentation for non drm/i915 developers.  Earlier RFC discussions
were hampered somewhat by limited documentation.  Improved
documentation is always desirable, but of course it can also take a
significant amount of time and effort while some key aspects
(notably the PRMs) aren't directly under my control.
 

- The current OA PMU driver breaks some significant design assumptions.

Existing perf pmus are used for profiling work on a cpu and we're
introducing the idea of _IS_DEVICE pmus with different security
implications, the need to fake cpu-related data (such as user/kernel
registers) to fit with perf's current design, and adding _DEVICE
records as a way to forward device-specific status records.

The OA unit writes reports of counters into a circular buffer,
without involvement from the CPU, making our PMU driver the first of
a kind.

Given the way we periodically forward data from the OA buffer to
perf's buffer, these bursts of sample writes look to perf like we're
sampling too fast and so it throttles us.

Perf supports groups of counters and allows those to be read via
transactions internally but transactions currently seem designed to
be explicitly initiated from the cpu (say in response to a userspace
read()) and while we could pull a report out of the OA buffer we
can't trigger a report from the cpu on demand.

Related to being report based; the OA counters are configured in HW
as a set while perf generally expects counter configurations to be
orthogonal. Although counters can be associated with a group leader
as they are opened, there's no clear precedent for being able to
provide group-wide configuration attributes and no obvious solution
as yet that's expected to be acceptable to upstream and meets our
userspace needs. We currently avoid using perf's grouping feature
and forward OA reports to userspace via perf's 'raw' sample field.
This suits our userspace well considering how coupled the counters
are when dealing with normalizing. It would be inconvenient to split
counters up into separate events, only to require userspace to
recombine them. For Mesa it's also convenient to be forwarded raw,
periodic reports for combining with the raw reports it captures
using MI_REPORT_PERF_COUNT commands.

Related to counter orthogonality; we can't time share the OA unit,
while event scheduling is a central design idea within perf for
allowing userspace to open + enable more events than can 

Re: [Intel-gfx] [RFC 0/6] Non perf based Gen Graphics OA unit driver

2015-09-29 Thread Zhenyu Wang
On 2015.09.29 15:39:03 +0100, Robert Bragg wrote:
> 
> - Logistically it might be more practical to contain this to the
>   graphics stack.
> 
> It seems fair to consider that if we can't see a very compelling
> benefit to building on perf, then containing this work to
> drivers/gpu/drm/i915 may simplify the review process as well as
> future maintenance and development.
> 

I think even we all initially like to go with perf but it appears later
that we might need to stick this more close with i915 driver. Also think
about to enable global profiling for all graphics clients, extending or
enabling it within i915 specific interface seems more feasible instead of
trying to create another PMU driver like previous implementation attempt
to suit the need for different gfx perf data definition.

Robert, thanks for send and elaborate on this.

-- 
Open Source Technology Center, Intel ltd.

$gpg --keyserver wwwkeys.pgp.net --recv-keys 4D781827


signature.asc
Description: Digital signature
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx