[Intel-gfx] [RFC PATCH 07/11] drm/i915: Expose PMU for Observation Architecture

2015-05-18 Thread Robert Bragg
On 7 May 2015 15:58, "Chris Wilson"  wrote:
>
> On Thu, May 07, 2015 at 03:15:50PM +0100, Robert Bragg wrote:
> > + /* We bypass the default perf core perf_paranoid_cpu() ||
> > +  * CAP_SYS_ADMIN check by using the PERF_PMU_CAP_IS_DEVICE
> > +  * flag and instead authenticate based on whether the current
> > +  * pid owns the specified context, or require CAP_SYS_ADMIN
> > +  * when collecting cross-context metrics.
> > +  */
> > + dev_priv->oa_pmu.specific_ctx = NULL;
> > + if (oa_attr.single_context) {
> > + u32 ctx_id = oa_attr.ctx_id;
> > + unsigned int drm_fd = oa_attr.drm_fd;
> > + struct fd fd = fdget(drm_fd);
> > +
> > + if (fd.file) {
>
> Specify a ctx and not providing the right fd should be its own error,
> either EBADF or EINVAL.

Right, I went for both in the end; EBADF if fdget fails and EINVAL if
the fd is ok but we fail to lookup a context with it.

>
> > + dev_priv->oa_pmu.specific_ctx =
> > + lookup_context(dev_priv, fd.file, ctx_id);
> > + }
>
> Missing fdput

Ah yes; fixed.

>
> > + }
> > +
> > + if (!dev_priv->oa_pmu.specific_ctx && !capable(CAP_SYS_ADMIN))
> > + return -EACCES;
> > +
> > + mutex_lock(_priv->dev->struct_mutex);
>
> i915_mutex_interruptible, probably best to couple into the GPU error
> handling here as well especially as init_oa_buffer() will go onto touch
> GPU internals.

Ok, using i915_mutex_interruptible makes sense, I've also moved the
locking into init_oa_buffer.

About the GPU error handling, do you have any thoughts on what could
be most helpful here? I'm thinking a.t.m of extending
i915_capture_reg_state() in i915_gpu_error.c to capture the OACONTROL
+ OASTATUS state and perhaps all the UCGCTL unit clock gating state
too.

>
> > + ret = init_oa_buffer(event);
> > + mutex_unlock(_priv->dev->struct_mutex);
> > +
> > + if (ret)
> > + return ret;
> > +
> > + BUG_ON(dev_priv->oa_pmu.exclusive_event);
> > + dev_priv->oa_pmu.exclusive_event = event;
> > +
> > + event->destroy = i915_oa_event_destroy;
> > +
> > + /* PRM - observability performance counters:
> > +  *
> > +  *   OACONTROL, performance counter enable, note:
> > +  *
> > +  *   "When this bit is set, in order to have coherent counts,
> > +  *   RC6 power state and trunk clock gating must be disabled.
> > +  *   This can be achieved by programming MMIO registers as
> > +  *   0xA094=0 and 0xA090[31]=1"
> > +  *
> > +  *   In our case we are expected that taking pm + FORCEWAKE
> > +  *   references will effectively disable RC6 and trunk clock
> > +  *   gating.
> > +  */
> > + intel_runtime_pm_get(dev_priv);
> > + intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>
> That is a nuisance. Aside: Why isn't OA inside the powerctx? Is a subset
> valid with forcewake? It does perturb the system greatly to disable rc6,
> so I wonder if it could be made optional?

Yes, it's a shame.

I probably only really know enough about the OA unit design to be
dangerous and won't try and comment in detail here, but I think
there's more to it than not saving state in a power context. As I
understand it, there were a number of design changes made to enable
OA+RC6 support for BDW+, including having the OA unit automatically
write out reports to the OA buffer when entering RC6.

I think just FORCEWAKE_RENDER would work here, but only say that
because it looks like HSW only has the render forcewake domain from
what I could tell.

I think I need to update the comment above these lines as I don't
think these will affect crclk gating; these just handle disabling RC6.

The WIP patch I sent out basically represents me trying to get to the
bottom of the clock gating constraints we have.

At this point I think I need to disable CS unit gating via UCGCTL1, as
well as DOP gating for the render trunk clock via MISCCPCTL but I'm
not entirely confident about that just yet. At least empirically I see
these fixing some issues in rudimentary testing.

>
> > +
> > + return 0;
> > +}
> > +
> > +static void update_oacontrol(struct drm_i915_private *dev_priv)
> > +{
> > + BUG_ON(!spin_is_locked(_priv->oa_pmu.lock));
> > +
> > + if (dev_priv->oa_pmu.event_active) {
> > + unsigned long ctx_id = 0;
> > + bool pinning_ok = false;
> > +
> > + if (dev_priv->oa_pmu.specific_ctx) {
> > + struct intel_context *ctx =
> > + dev_priv->oa_pmu.specific_ctx;
> > + struct drm_i915_gem_object *obj =
> > + ctx->legacy_hw_ctx.rcs_state;
>
> If only there was ctx->legacy_hw_ctx.rcs_vma...

ok, not sure if this is a prod to add that, a heads up that this is
coming or seething because some prior attempt to add this was nack'd.


>
> > +
> > + if 

[Intel-gfx] [RFC PATCH 07/11] drm/i915: Expose PMU for Observation Architecture

2015-05-18 Thread Robert Bragg
On 7 May 2015 15:37, "Chris Wilson"  wrote:
>
> On Thu, May 07, 2015 at 03:15:50PM +0100, Robert Bragg wrote:
> > +static int init_oa_buffer(struct perf_event *event)
> > +{
> > + struct drm_i915_private *dev_priv =
> > + container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
> > + struct drm_i915_gem_object *bo;
> > + int ret;
> > +
> > + BUG_ON(!IS_HASWELL(dev_priv->dev));
> > + BUG_ON(!mutex_is_locked(_priv->dev->struct_mutex));
> > + BUG_ON(dev_priv->oa_pmu.oa_buffer.obj);
> > +
> > + spin_lock_init(_priv->oa_pmu.oa_buffer.flush_lock);
> > +
> > + /* NB: We over allocate the OA buffer due to the way raw sample
data
> > +  * gets copied from the gpu mapped circular buffer into the perf
> > +  * circular buffer so that only one copy is required.
> > +  *
> > +  * For each perf sample (raw->size + 4) needs to be 8 byte
aligned,
> > +  * where the 4 corresponds to the 32bit raw->size member that's
> > +  * added to the sample header that userspace sees.
> > +  *
> > +  * Due to the + 4 for the size member: when we copy a report to
the
> > +  * userspace facing perf buffer we always copy an additional 4
bytes
> > +  * from the subsequent report to make up for the miss alignment,
but
> > +  * when a report is at the end of the gpu mapped buffer we need to
> > +  * read 4 bytes past the end of the buffer.
> > +  */
> > + bo = i915_gem_alloc_object(dev_priv->dev, OA_BUFFER_SIZE +
PAGE_SIZE);
> > + if (bo == NULL) {
> > + DRM_ERROR("Failed to allocate OA buffer\n");
> > + ret = -ENOMEM;
> > + goto err;
> > + }
> > + dev_priv->oa_pmu.oa_buffer.obj = bo;
> > +
> > + ret = i915_gem_object_set_cache_level(bo, I915_CACHE_LLC);
> > + if (ret)
> > + goto err_unref;
> > +
> > + /* PreHSW required 512K alignment, HSW requires 16M */
> > + ret = i915_gem_obj_ggtt_pin(bo, SZ_16M, 0);
> > + if (ret)
> > + goto err_unref;
> > +
> > + dev_priv->oa_pmu.oa_buffer.gtt_offset =
i915_gem_obj_ggtt_offset(bo);
> > + dev_priv->oa_pmu.oa_buffer.addr = vmap_oa_buffer(bo);
>
> You can look forward to both i915_gem_object_create_internal() and
> i915_gem_object_pin_vmap()

Okey, will do, thanks.

>
> > +
> > + /* Pre-DevBDW: OABUFFER must be set with counters off,
> > +  * before OASTATUS1, but after OASTATUS2 */
> > + I915_WRITE(GEN7_OASTATUS2, dev_priv->oa_pmu.oa_buffer.gtt_offset |
> > +GEN7_OASTATUS2_GGTT); /* head */
> > + I915_WRITE(GEN7_OABUFFER, dev_priv->oa_pmu.oa_buffer.gtt_offset);
> > + I915_WRITE(GEN7_OASTATUS1, dev_priv->oa_pmu.oa_buffer.gtt_offset |
> > +GEN7_OASTATUS1_OABUFFER_SIZE_16M); /* tail */
> > +
> > + DRM_DEBUG_DRIVER("OA Buffer initialized, gtt offset = 0x%x, vaddr
= %p",
> > +  dev_priv->oa_pmu.oa_buffer.gtt_offset,
> > +  dev_priv->oa_pmu.oa_buffer.addr);
> > +
> > + return 0;
> > +
> > +err_unref:
> > + drm_gem_object_unreference_unlocked(>base);
>
> But what I really what to say was:
> mutex deadlock^^^

Yikes, I've pushed an updated patch addressing this and can reply with a
new patch here in a bit.

Thanks,
- Robert


> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre
-- next part --
An HTML attachment was scrubbed...
URL: 



[Intel-gfx] [RFC PATCH 07/11] drm/i915: Expose PMU for Observation Architecture

2015-05-07 Thread Chris Wilson
On Thu, May 07, 2015 at 03:15:50PM +0100, Robert Bragg wrote:
> + /* We bypass the default perf core perf_paranoid_cpu() ||
> +  * CAP_SYS_ADMIN check by using the PERF_PMU_CAP_IS_DEVICE
> +  * flag and instead authenticate based on whether the current
> +  * pid owns the specified context, or require CAP_SYS_ADMIN
> +  * when collecting cross-context metrics.
> +  */
> + dev_priv->oa_pmu.specific_ctx = NULL;
> + if (oa_attr.single_context) {
> + u32 ctx_id = oa_attr.ctx_id;
> + unsigned int drm_fd = oa_attr.drm_fd;
> + struct fd fd = fdget(drm_fd);
> +
> + if (fd.file) {

Specify a ctx and not providing the right fd should be its own error,
either EBADF or EINVAL.

> + dev_priv->oa_pmu.specific_ctx =
> + lookup_context(dev_priv, fd.file, ctx_id);
> + }

Missing fdput

> + }
> +
> + if (!dev_priv->oa_pmu.specific_ctx && !capable(CAP_SYS_ADMIN))
> + return -EACCES;
> +
> + mutex_lock(_priv->dev->struct_mutex);

i915_mutex_interruptible, probably best to couple into the GPU error
handling here as well especially as init_oa_buffer() will go onto touch
GPU internals.

> + ret = init_oa_buffer(event);
> + mutex_unlock(_priv->dev->struct_mutex);
> +
> + if (ret)
> + return ret;
> +
> + BUG_ON(dev_priv->oa_pmu.exclusive_event);
> + dev_priv->oa_pmu.exclusive_event = event;
> +
> + event->destroy = i915_oa_event_destroy;
> +
> + /* PRM - observability performance counters:
> +  *
> +  *   OACONTROL, performance counter enable, note:
> +  *
> +  *   "When this bit is set, in order to have coherent counts,
> +  *   RC6 power state and trunk clock gating must be disabled.
> +  *   This can be achieved by programming MMIO registers as
> +  *   0xA094=0 and 0xA090[31]=1"
> +  *
> +  *   In our case we are expected that taking pm + FORCEWAKE
> +  *   references will effectively disable RC6 and trunk clock
> +  *   gating.
> +  */
> + intel_runtime_pm_get(dev_priv);
> + intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);

That is a nuisance. Aside: Why isn't OA inside the powerctx? Is a subset
valid with forcewake? It does perturb the system greatly to disable rc6,
so I wonder if it could be made optional?

> +
> + return 0;
> +}
> +
> +static void update_oacontrol(struct drm_i915_private *dev_priv)
> +{
> + BUG_ON(!spin_is_locked(_priv->oa_pmu.lock));
> +
> + if (dev_priv->oa_pmu.event_active) {
> + unsigned long ctx_id = 0;
> + bool pinning_ok = false;
> +
> + if (dev_priv->oa_pmu.specific_ctx) {
> + struct intel_context *ctx =
> + dev_priv->oa_pmu.specific_ctx;
> + struct drm_i915_gem_object *obj =
> + ctx->legacy_hw_ctx.rcs_state;

If only there was ctx->legacy_hw_ctx.rcs_vma...

> +
> + if (i915_gem_obj_is_pinned(obj)) {
> + ctx_id = i915_gem_obj_ggtt_offset(obj);
> + pinning_ok = true;
> + }
> + }
> +
> + if ((ctx_id == 0 || pinning_ok)) {
> + bool periodic = dev_priv->oa_pmu.periodic;
> + u32 period_exponent = dev_priv->oa_pmu.period_exponent;
> + u32 report_format = dev_priv->oa_pmu.oa_buffer.format;
> +
> + I915_WRITE(GEN7_OACONTROL,
> +(ctx_id & GEN7_OACONTROL_CTX_MASK) |
> +(period_exponent <<
> + GEN7_OACONTROL_TIMER_PERIOD_SHIFT) |
> +(periodic ?
> + GEN7_OACONTROL_TIMER_ENABLE : 0) |
> +(report_format <<
> + GEN7_OACONTROL_FORMAT_SHIFT) |
> +(ctx_id ?
> + GEN7_OACONTROL_PER_CTX_ENABLE : 0) |
> +GEN7_OACONTROL_ENABLE);

I notice you don't use any write barriers...
-Chris
-- 
Chris Wilson, Intel Open Source Technology Centre


[Intel-gfx] [RFC PATCH 07/11] drm/i915: Expose PMU for Observation Architecture

2015-05-07 Thread Chris Wilson
On Thu, May 07, 2015 at 03:15:50PM +0100, Robert Bragg wrote:
> +static int init_oa_buffer(struct perf_event *event)
> +{
> + struct drm_i915_private *dev_priv =
> + container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
> + struct drm_i915_gem_object *bo;
> + int ret;
> +
> + BUG_ON(!IS_HASWELL(dev_priv->dev));
> + BUG_ON(!mutex_is_locked(_priv->dev->struct_mutex));
> + BUG_ON(dev_priv->oa_pmu.oa_buffer.obj);
> +
> + spin_lock_init(_priv->oa_pmu.oa_buffer.flush_lock);
> +
> + /* NB: We over allocate the OA buffer due to the way raw sample data
> +  * gets copied from the gpu mapped circular buffer into the perf
> +  * circular buffer so that only one copy is required.
> +  *
> +  * For each perf sample (raw->size + 4) needs to be 8 byte aligned,
> +  * where the 4 corresponds to the 32bit raw->size member that's
> +  * added to the sample header that userspace sees.
> +  *
> +  * Due to the + 4 for the size member: when we copy a report to the
> +  * userspace facing perf buffer we always copy an additional 4 bytes
> +  * from the subsequent report to make up for the miss alignment, but
> +  * when a report is at the end of the gpu mapped buffer we need to
> +  * read 4 bytes past the end of the buffer.
> +  */
> + bo = i915_gem_alloc_object(dev_priv->dev, OA_BUFFER_SIZE + PAGE_SIZE);
> + if (bo == NULL) {
> + DRM_ERROR("Failed to allocate OA buffer\n");
> + ret = -ENOMEM;
> + goto err;
> + }
> + dev_priv->oa_pmu.oa_buffer.obj = bo;
> +
> + ret = i915_gem_object_set_cache_level(bo, I915_CACHE_LLC);
> + if (ret)
> + goto err_unref;
> +
> + /* PreHSW required 512K alignment, HSW requires 16M */
> + ret = i915_gem_obj_ggtt_pin(bo, SZ_16M, 0);
> + if (ret)
> + goto err_unref;
> +
> + dev_priv->oa_pmu.oa_buffer.gtt_offset = i915_gem_obj_ggtt_offset(bo);
> + dev_priv->oa_pmu.oa_buffer.addr = vmap_oa_buffer(bo);

You can look forward to both i915_gem_object_create_internal() and
i915_gem_object_pin_vmap()

> +
> + /* Pre-DevBDW: OABUFFER must be set with counters off,
> +  * before OASTATUS1, but after OASTATUS2 */
> + I915_WRITE(GEN7_OASTATUS2, dev_priv->oa_pmu.oa_buffer.gtt_offset |
> +GEN7_OASTATUS2_GGTT); /* head */
> + I915_WRITE(GEN7_OABUFFER, dev_priv->oa_pmu.oa_buffer.gtt_offset);
> + I915_WRITE(GEN7_OASTATUS1, dev_priv->oa_pmu.oa_buffer.gtt_offset |
> +GEN7_OASTATUS1_OABUFFER_SIZE_16M); /* tail */
> +
> + DRM_DEBUG_DRIVER("OA Buffer initialized, gtt offset = 0x%x, vaddr = %p",
> +  dev_priv->oa_pmu.oa_buffer.gtt_offset,
> +  dev_priv->oa_pmu.oa_buffer.addr);
> +
> + return 0;
> +
> +err_unref:
> + drm_gem_object_unreference_unlocked(>base);

But what I really what to say was:
mutex deadlock^^^
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre