from:"Lionel Landwerlin"

Re: [PATCH] drm/i915/gt: Report full vm address range

2024-03-14 Thread Lionel Landwerlin


Hi Andi,

In Mesa we've been relying on I915_CONTEXT_PARAM_GTT_SIZE so as long as 
that is adjusted by the kernel, we should be able to continue working 
without issues.


Acked-by: Lionel Landwerlin 

Thanks,

-Lionel

On 13/03/2024 21:39, Andi Shyti wrote:

Commit 9bb66c179f50 ("drm/i915: Reserve some kernel space per
vm") has reserved an object for kernel space usage.

Userspace, though, needs to know the full address range.

Fixes: 9bb66c179f50 ("drm/i915: Reserve some kernel space per vm")
Signed-off-by: Andi Shyti 
Cc: Andrzej Hajda 
Cc: Chris Wilson 
Cc: Lionel Landwerlin 
Cc: Michal Mrozek 
Cc: Nirmoy Das 
Cc:  # v6.2+
---
  drivers/gpu/drm/i915/gt/gen8_ppgtt.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c 
b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
index fa46d2308b0e..d76831f50106 100644
--- a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
@@ -982,8 +982,9 @@ static int gen8_init_rsvd(struct i915_address_space *vm)
  
  	vm->rsvd.vma = i915_vma_make_unshrinkable(vma);

vm->rsvd.obj = obj;
-   vm->total -= vma->node.size;
+
return 0;
+
  unref:
i915_gem_object_put(obj);
return ret;

Re: [Intel-gfx] [PATCH 1/2] drm/i915/perf: Subtract gtt_offset from hw_tail

2023-07-18 Thread Lionel Landwerlin


On 18/07/2023 05:43, Ashutosh Dixit wrote:

The code in oa_buffer_check_unlocked() is correct only if the OA buffer is
16 MB aligned (which seems to be the case today in i915). However when the
16 MB alignment is dropped, when we "Subtract partial amount off the tail",
the "& (OA_BUFFER_SIZE - 1)" operation in OA_TAKEN() will result in an
incorrect hw_tail value.

Therefore hw_tail must be brought to the same base as head and read_tail
prior to OA_TAKEN by subtracting gtt_offset from hw_tail.

Signed-off-by: Ashutosh Dixit 
---
  drivers/gpu/drm/i915/i915_perf.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 49c6f1ff11284..f7888a44d1284 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -565,6 +565,7 @@ static bool oa_buffer_check_unlocked(struct 
i915_perf_stream *stream)
partial_report_size %= report_size;
  
  	/* Subtract partial amount off the tail */

+   hw_tail -= gtt_offset;
hw_tail = OA_TAKEN(hw_tail, partial_report_size);
  
  	/* NB: The head we observe here might effectively be a little



You should squash this patch with the next one. Otherwise further down 
this function there is another


hw_tail -= gtt_offset;


-Lionel

Re: [Intel-gfx] [PATCH] drm/i915: Introduce Wa_14011274333

2023-07-13 Thread Lionel Landwerlin


On 13/07/2023 02:34, Matt Atwood wrote:

Wa_14011274333 applies to RKL, ADL-S, ADL-P and TGL. ALlocate buffer
pinned to GGTT and add WA to restore impacted registers.

v2: use correct lineage number, more generically apply workarounds for
all registers impacted, move workaround to gt/intel_workarounds.c
(MattR)

Based off patch by Tilak Tangudu.

Signed-off-by: Matt Atwood 



I applied this patch to drm-tip and as far as I can tell it doesn't fix 
the problem of the SAMPLER_MODE register loosing its bit0 programming.



-Lionel



---
  drivers/gpu/drm/i915/gt/intel_gt_regs.h |  5 ++
  drivers/gpu/drm/i915/gt/intel_rc6.c | 63 +
  drivers/gpu/drm/i915/gt/intel_rc6_types.h   |  3 +
  drivers/gpu/drm/i915/gt/intel_workarounds.c | 40 +
  4 files changed, 111 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index 718cb2c80f79e..eaee35ecbc8d3 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -83,6 +83,11 @@
  #define   MTL_MCR_GROUPID REG_GENMASK(11, 8)
  #define   MTL_MCR_INSTANCEID  REG_GENMASK(3, 0)
  
+#define CTX_WA_PTR_MMIO(0x2058)

+#define  CTX_WA_PTR_ADDR_MASK  REG_GENMASK(31,12)
+#define  CTX_WA_TYPE_MASK  REG_GENMASK(4,3)
+#define  CTX_WA_VALID  REG_BIT(0)
+
  #define IPEIR_I965_MMIO(0x2064)
  #define IPEHR_I965_MMIO(0x2068)
  
diff --git a/drivers/gpu/drm/i915/gt/intel_rc6.c b/drivers/gpu/drm/i915/gt/intel_rc6.c

index 58bb1c55294c9..6baa341814da7 100644
--- a/drivers/gpu/drm/i915/gt/intel_rc6.c
+++ b/drivers/gpu/drm/i915/gt/intel_rc6.c
@@ -14,6 +14,7 @@
  #include "intel_gt.h"
  #include "intel_gt_pm.h"
  #include "intel_gt_regs.h"
+#include "intel_gpu_commands.h"
  #include "intel_pcode.h"
  #include "intel_rc6.h"
  
@@ -53,6 +54,65 @@ static struct drm_i915_private *rc6_to_i915(struct intel_rc6 *rc)

return rc6_to_gt(rc)->i915;
  }
  
+static int rc6_wa_bb_ctx_init(struct intel_rc6 *rc6)

+{
+   struct drm_i915_private *i915 = rc6_to_i915(rc6);
+   struct intel_gt *gt = rc6_to_gt(rc6);
+   struct drm_i915_gem_object *obj;
+   struct i915_vma *vma;
+   void *batch;
+   struct i915_gem_ww_ctx ww;
+   int err;
+
+   obj = i915_gem_object_create_shmem(i915, PAGE_SIZE);
+   if (IS_ERR(obj))
+   return PTR_ERR(obj);
+
+   vma = i915_vma_instance(obj, >ggtt->vm, NULL);
+   if (IS_ERR(vma)){
+   err = PTR_ERR(vma);
+   goto err;
+   }
+   rc6->vma=vma;
+   i915_gem_ww_ctx_init(, true);
+retry:
+   err = i915_gem_object_lock(obj, );
+   if (!err)
+   err = i915_ggtt_pin(rc6->vma, , 0, PIN_HIGH);
+   if (err)
+   goto err_ww_fini;
+
+   batch = i915_gem_object_pin_map(obj, I915_MAP_WB);
+   if (IS_ERR(batch)) {
+   err = PTR_ERR(batch);
+   goto err_unpin;
+   }
+   rc6->rc6_wa_bb = batch;
+   return 0;
+err_unpin:
+   if (err)
+   i915_vma_unpin(rc6->vma);
+err_ww_fini:
+   if (err == -EDEADLK) {
+   err = i915_gem_ww_ctx_backoff();
+   if (!err)
+   goto retry;
+   }
+   i915_gem_ww_ctx_fini();
+
+   if (err)
+   i915_vma_put(rc6->vma);
+err:
+   i915_gem_object_put(obj);
+   return err;
+}
+
+void rc6_wa_bb_ctx_wa_fini(struct intel_rc6 *rc6)
+{
+   i915_vma_unpin(rc6->vma);
+   i915_vma_put(rc6->vma);
+}
+
  static void gen11_rc6_enable(struct intel_rc6 *rc6)
  {
struct intel_gt *gt = rc6_to_gt(rc6);
@@ -616,6 +676,9 @@ void intel_rc6_init(struct intel_rc6 *rc6)
err = chv_rc6_init(rc6);
else if (IS_VALLEYVIEW(i915))
err = vlv_rc6_init(rc6);
+   else if ((GRAPHICS_VER_FULL(i915) >= IP_VER(12, 0)) &&
+(GRAPHICS_VER_FULL(i915) <= IP_VER(12, 70)))
+   err = rc6_wa_bb_ctx_init(rc6);
else
err = 0;
  
diff --git a/drivers/gpu/drm/i915/gt/intel_rc6_types.h b/drivers/gpu/drm/i915/gt/intel_rc6_types.h

index cd4587098162a..643fd4e839ad4 100644
--- a/drivers/gpu/drm/i915/gt/intel_rc6_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_rc6_types.h
@@ -33,6 +33,9 @@ struct intel_rc6 {
  
  	struct drm_i915_gem_object *pctx;
  
+	u32 *rc6_wa_bb;

+   struct i915_vma *vma;
+
bool supported : 1;
bool enabled : 1;
bool manual : 1;
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 4d2dece960115..d20afb318d857 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -14,6 +14,7 @@
  #include "intel_gt_regs.h"
  #include "intel_ring.h"
  #include "intel_workarounds.h"
+#include "intel_rc6.h"
  
  /**

Re: [Intel-gfx] [PATCH] drm/i915/perf: Clear out entire reports after reading if not power of 2 size

2023-05-23 Thread Lionel Landwerlin


On 22/05/2023 23:17, Ashutosh Dixit wrote:

Clearing out report id and timestamp as means to detect unlanded reports
only works if report size is power of 2. That is, only when report size is
a sub-multiple of the OA buffer size can we be certain that reports will
land at the same place each time in the OA buffer (after rewind). If report
size is not a power of 2, we need to zero out the entire report to be able
to detect unlanded reports reliably.

Cc: Umesh Nerlige Ramappa 
Signed-off-by: Ashutosh Dixit 


Sad but necessary unfortunately


Reviewed-by:  Lionel Landwerlin 



---
  drivers/gpu/drm/i915/i915_perf.c | 17 +++--
  1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 19d5652300eeb..58284156428dc 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -877,12 +877,17 @@ static int gen8_append_oa_reports(struct i915_perf_stream 
*stream,
stream->oa_buffer.last_ctx_id = ctx_id;
}
  
-		/*

-* Clear out the report id and timestamp as a means to detect 
unlanded
-* reports.
-*/
-   oa_report_id_clear(stream, report32);
-   oa_timestamp_clear(stream, report32);
+   if (is_power_of_2(report_size)) {
+   /*
+* Clear out the report id and timestamp as a means
+* to detect unlanded reports.
+*/
+   oa_report_id_clear(stream, report32);
+   oa_timestamp_clear(stream, report32);
+   } else {
+   /* Zero out the entire report */
+   memset(report32, 0, report_size);
+   }
}
  
  	if (start_offset != *offset) {

Re: [Intel-gfx] [PATCH] i915/perf: Avoid reading OA reports before they land

2023-05-20 Thread Lionel Landwerlin


Hi Umesh,

Looks like there is still a problem with the if block moving the 
stream->oa_buffer.tail forward.

An application not doing any polling would still run into the same problem.

If I understand correctly this change, it means the time based 
workaround doesn't work.
We need to actually check the report's content before moving the 
software tracked tail.

If that's the case, they maybe we should just delete that code.

-Lionel

On 20/05/2023 01:56, Umesh Nerlige Ramappa wrote:

On DG2, capturing OA reports while running heavy render workloads
sometimes results in invalid OA reports where 64-byte chunks inside
reports have stale values. Under memory pressure, high OA sampling rates
(13.3 us) and heavy render workload, occassionally, the OA HW TAIL
pointer does not progress as fast as the sampling rate. When these
glitches occur, the TAIL pointer takes approx. 200us to progress.  While
this is expected behavior from the HW perspective, invalid reports are
not expected.

In oa_buffer_check_unlocked(), when we execute the if condition, we are
updating the oa_buffer.tail to the aging tail and then setting pollin
based on this tail value, however, we do not have a chance to rewind and
validate the reports prior to setting pollin. The validation happens
in a subsequent call to oa_buffer_check_unlocked(). If a read occurs
before this validation, then we end up reading reports up until this
oa_buffer.tail value which includes invalid reports. Though found on
DG2, this affects all platforms.

Set the pollin only in the else condition in oa_buffer_check_unlocked.

Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/7484
Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/7757
Signed-off-by: Umesh Nerlige Ramappa 
---
  drivers/gpu/drm/i915/i915_perf.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 19d5652300ee..61536e3c4ac9 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -545,7 +545,7 @@ static bool oa_buffer_check_unlocked(struct 
i915_perf_stream *stream)
u32 gtt_offset = i915_ggtt_offset(stream->oa_buffer.vma);
int report_size = stream->oa_buffer.format->size;
unsigned long flags;
-   bool pollin;
+   bool pollin = false;
u32 hw_tail;
u64 now;
u32 partial_report_size;
@@ -620,10 +620,10 @@ static bool oa_buffer_check_unlocked(struct 
i915_perf_stream *stream)
stream->oa_buffer.tail = gtt_offset + tail;
stream->oa_buffer.aging_tail = gtt_offset + hw_tail;
stream->oa_buffer.aging_timestamp = now;
-   }
  
-	pollin = OA_TAKEN(stream->oa_buffer.tail - gtt_offset,

- stream->oa_buffer.head - gtt_offset) >= report_size;
+   pollin = OA_TAKEN(stream->oa_buffer.tail - gtt_offset,
+ stream->oa_buffer.head - gtt_offset) >= 
report_size;
+   }
  
  	spin_unlock_irqrestore(>oa_buffer.ptr_lock, flags);

Re: [Intel-gfx] [PATCH v8 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP

2023-04-27 Thread Lionel Landwerlin


On 27/04/2023 21:19, Teres Alexis, Alan Previn wrote:

(fixed email addresses again - why is my Evolution client deteorating??)

On Thu, 2023-04-27 at 17:18 +, Teres Alexis, Alan Previn wrote:

On Wed, 2023-04-26 at 15:35 -0700, Justen, Jordan L wrote:

On 2023-04-26 11:17:16, Teres Alexis, Alan Previn wrote:

alan:snip

Can you tell that pxp is in progress, but not ready yet, as a separate
state from 'it will never work on this platform'? If so, maybe the
status could return something like:

0: It's never going to work
1: It's ready to use
2: It's starting and should work soon

I could see an argument for treating that as a case where we could
still advertise protected content support, but if we try to use it we
might be in for a nasty delay.


alan: IIRC Lionel seemed okay with any permutation that would allow it to not
get blocked. Daniele did ask for something similiar to what u mentioned above
but he said that is non-blocking. But since both you AND Daniele have mentioned
the same thing, i shall re-rev this and send that change out today.
I notice most GET_PARAMS use -ENODEV for "never gonna work" so I will stick 
with that.
but 1 = ready to use and 2 = starting and should work sounds good. so '0' will 
never
be returned - we just look for a positive value (from user space). I will also
make a PR for mesa side as soon as i get it tested. thanks for reviewing btw.

alan: I also realize with these final touch-ups, we can go back to the original
pxp-context-creation timeout of 250 milisecs like it was on ADL since the user
space component will have this new param to check on (so even farther down from
1 sec on the last couple of revs).

Jordan, Lional - i am thinking of creating the PR on MESA side to take advantage
of GET_PARAM on both get-caps AND runtime creation (latter will be useful to 
ensure
no unnecesssary delay experienced by Mesa stuck in kernel call - which 
practically
never happenned in ADL AFAIK):

1. MESA PXP get caps:
- use GET_PARAM (any positive number shall mean its supported).
2. MESA app-triggered PXP context creation (i.e. if caps was supported):
- use GET_PARAM to wait until positive number switches from "2" to "1".
- now call context creation. So at this point if it fails, we know its
  an actual failure.

you guys okay with above? (i'll re-rev this kernel series first and wait on your
ack or feedback before i create/ test/ submit a PR for Mesa side).



Sounds good.

Thanks,


-Lionel

Re: [Intel-gfx] [PATCH v7 6/8] drm/i915/uapi/pxp: Fix UAPI spec comments and add GET_PARAM for PXP

2023-04-18 Thread Lionel Landwerlin


On 14/04/2023 18:17, Teres Alexis, Alan Previn wrote:

Hi Lionel, does this patch work for you?



Hi, Sorry for the late answer.

That looks good :


Acked-by: Lionel Landwerlin 


Thanks,


-Lionel




On Mon, 2023-04-10 at 10:22 -0700, Ceraolo Spurio, Daniele wrote:

On 4/6/2023 10:44 AM, Alan Previn wrote:

alan:snip


+/*
+ * Query the status of PXP support in i915.
+ *
+ * The query can fail in the following scenarios with the listed error codes:
+ *  -ENODEV = PXP support is not available on the GPU device or in the kernel
+ *due to missing component drivers or kernel configs.
+ * If the IOCTL is successful, the returned parameter will be set to one of the
+ * following values:
+ *   0 = PXP support maybe available but underlying SOC fusing, BIOS or 
firmware
+ *   configuration is unknown and a PXP-context-creation would be required
+ *   for final verification of feature availibility.

Would it be useful to add:

1 = PXP support is available

And start returning that after we've successfully created our first
session? Not sure if userspace would use this though, since they still
need to handle the 0 case anyway.
I'm also ok with this patch as-is, as long as you get an ack from the
userspace drivers for this interface behavior:

Reviewed-by: Daniele Ceraolo Spurio 

Daniele

alan:snip

Re: [Intel-gfx] [PATCH] drm/i915: disable sampler indirect state in bindless heap

2023-04-07 Thread Lionel Landwerlin


On 07/04/2023 12:32, Lionel Landwerlin wrote:

By default the indirect state sampler data (border colors) are stored
in the same heap as the SAMPLER_STATE structure. For userspace drivers
that can be 2 different heaps (dynamic state heap & bindless sampler
state heap). This means that border colors have to copied in 2
different places so that the same SAMPLER_STATE structure find the
right data.

This change is forcing the indirect state sampler data to only be in
the dynamic state pool (more convinient for userspace drivers, they
only have to have one copy of the border colors). This is reproducing
the behavior of the Windows drivers.

BSpec: 46052

Signed-off-by: Lionel Landwerlin 
Cc: sta...@vger.kernel.org
Reviewed-by: Haridhar Kalvala 



Screwed up the subject-prefix, but this is v4.

Rebased due to another change touching the same register.


-Lionel



---
  drivers/gpu/drm/i915/gt/intel_gt_regs.h |  1 +
  drivers/gpu/drm/i915/gt/intel_workarounds.c | 19 +++
  2 files changed, 20 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index 492b3de6678d7..fd1f9cd35e9d7 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1145,6 +1145,7 @@
  #define   SC_DISABLE_POWER_OPTIMIZATION_EBB   REG_BIT(9)
  #define   GEN11_SAMPLER_ENABLE_HEADLESS_MSG   REG_BIT(5)
  #define   MTL_DISABLE_SAMPLER_SC_OOO  REG_BIT(3)
+#define   GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE  REG_BIT(0)
  
  #define GEN9_HALF_SLICE_CHICKEN7		MCR_REG(0xe194)

  #define   DG2_DISABLE_ROUND_ENABLE_ALLOW_FOR_SSLA REG_BIT(15)
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 6ea453ddd0116..b925ef47304b6 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -2971,6 +2971,25 @@ general_render_compute_wa_init(struct intel_engine_cs 
*engine, struct i915_wa_li
  
  	add_render_compute_tuning_settings(i915, wal);
  
+	if (GRAPHICS_VER(i915) >= 11) {

+   /* This is not a Wa (although referred to as
+* WaSetInidrectStateOverride in places), this allows
+* applications that reference sampler states through
+* the BindlessSamplerStateBaseAddress to have their
+* border color relative to DynamicStateBaseAddress
+* rather than BindlessSamplerStateBaseAddress.
+*
+* Otherwise SAMPLER_STATE border colors have to be
+* copied in multiple heaps (DynamicStateBaseAddress &
+* BindlessSamplerStateBaseAddress)
+*
+* BSpec: 46052
+*/
+   wa_mcr_masked_en(wal,
+GEN10_SAMPLER_MODE,
+GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE);
+   }
+
if (IS_MTL_GRAPHICS_STEP(i915, M, STEP_B0, STEP_FOREVER) ||
IS_MTL_GRAPHICS_STEP(i915, P, STEP_B0, STEP_FOREVER))
/* Wa_14017856879 */

[Intel-gfx] [PATCH] drm/i915: disable sampler indirect state in bindless heap

2023-04-07 Thread Lionel Landwerlin

By default the indirect state sampler data (border colors) are stored
in the same heap as the SAMPLER_STATE structure. For userspace drivers
that can be 2 different heaps (dynamic state heap & bindless sampler
state heap). This means that border colors have to copied in 2
different places so that the same SAMPLER_STATE structure find the
right data.

This change is forcing the indirect state sampler data to only be in
the dynamic state pool (more convinient for userspace drivers, they
only have to have one copy of the border colors). This is reproducing
the behavior of the Windows drivers.

BSpec: 46052

Signed-off-by: Lionel Landwerlin 
Cc: sta...@vger.kernel.org
Reviewed-by: Haridhar Kalvala 
---
 drivers/gpu/drm/i915/gt/intel_gt_regs.h |  1 +
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 19 +++
 2 files changed, 20 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index 492b3de6678d7..fd1f9cd35e9d7 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1145,6 +1145,7 @@
 #define   SC_DISABLE_POWER_OPTIMIZATION_EBBREG_BIT(9)
 #define   GEN11_SAMPLER_ENABLE_HEADLESS_MSGREG_BIT(5)
 #define   MTL_DISABLE_SAMPLER_SC_OOO   REG_BIT(3)
+#define   GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE  REG_BIT(0)
 
 #define GEN9_HALF_SLICE_CHICKEN7   MCR_REG(0xe194)
 #define   DG2_DISABLE_ROUND_ENABLE_ALLOW_FOR_SSLA  REG_BIT(15)
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 6ea453ddd0116..b925ef47304b6 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -2971,6 +2971,25 @@ general_render_compute_wa_init(struct intel_engine_cs 
*engine, struct i915_wa_li
 
add_render_compute_tuning_settings(i915, wal);
 
+   if (GRAPHICS_VER(i915) >= 11) {
+   /* This is not a Wa (although referred to as
+* WaSetInidrectStateOverride in places), this allows
+* applications that reference sampler states through
+* the BindlessSamplerStateBaseAddress to have their
+* border color relative to DynamicStateBaseAddress
+* rather than BindlessSamplerStateBaseAddress.
+*
+* Otherwise SAMPLER_STATE border colors have to be
+* copied in multiple heaps (DynamicStateBaseAddress &
+* BindlessSamplerStateBaseAddress)
+*
+* BSpec: 46052
+*/
+   wa_mcr_masked_en(wal,
+GEN10_SAMPLER_MODE,
+GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE);
+   }
+
if (IS_MTL_GRAPHICS_STEP(i915, M, STEP_B0, STEP_FOREVER) ||
IS_MTL_GRAPHICS_STEP(i915, P, STEP_B0, STEP_FOREVER))
/* Wa_14017856879 */
-- 
2.34.1

Re: [Intel-gfx] [PATCH 7/7] drm/i915: Allow user to set cache at BO creation

2023-04-05 Thread Lionel Landwerlin

On 04/04/2023 19:04, Yang, Fei wrote:

Subject: Re: [Intel-gfx] [PATCH 7/7] drm/i915: Allow user to set cache at BO 
creation

On 01/04/2023 09:38, fei.y...@intel.com wrote:

From: Fei Yang 

To comply with the design that buffer objects shall have immutable
cache setting through out its life cycle, {set, get}_caching ioctl's
are no longer supported from MTL onward. With that change caching
policy can only be set at object creation time. The current code
applies a default (platform dependent) cache setting for all objects.
However this is not optimal for performance tuning. The patch extends
the existing gem_create uAPI to let user set PAT index for the object
at creation time.
The new extension is platform independent, so UMD's can switch to
using this extension for older platforms as well, while {set,
get}_caching are still supported on these legacy paltforms for compatibility 
reason.

Cc: Chris Wilson 
Cc: Matt Roper 
Signed-off-by: Fei Yang 
Reviewed-by: Andi Shyti 

Just like the protected content uAPI, there is no way for userspace to tell
this feature is available other than trying using it.

Given the issues with protected content, is it not thing we could want to add?

Sorry I'm not aware of the issues with protected content, could you elaborate?
There was a long discussion on teams uAPI channel, could you comment there if
any concerns?

https://teams.microsoft.com/l/message/19:f1767bda6734476ba0a9c7d147b928d1@thread.skype/1675860924675?tenantId=46c98d88-e344-4ed4-8496-4ed7712e255d=379f3ae1-d138-4205-bb65-d4c7d38cb481=1675860924675=GSE%20OSGC=i915%20uAPI%20changes=1675860924675=false

Thanks,
-Fei

We wanted to have a getparam to detect protected support and were told 
to detect it by trying to create a context with it.

Now it appears trying to create a protected context can block for 
several seconds.

Since we have to report capabilities to the user even before it creates 
protected contexts, any app is at risk of blocking.

-Lionel

Thanks,

-Lionel

---
   drivers/gpu/drm/i915/gem/i915_gem_create.c | 33 
   include/uapi/drm/i915_drm.h| 36 ++
   tools/include/uapi/drm/i915_drm.h  | 36 ++
   3 files changed, 105 insertions(+)

Re: [Intel-gfx] [PATCH 7/7] drm/i915: Allow user to set cache at BO creation

2023-04-04 Thread Lionel Landwerlin


On 01/04/2023 09:38, fei.y...@intel.com wrote:

From: Fei Yang 

To comply with the design that buffer objects shall have immutable
cache setting through out its life cycle, {set, get}_caching ioctl's
are no longer supported from MTL onward. With that change caching
policy can only be set at object creation time. The current code
applies a default (platform dependent) cache setting for all objects.
However this is not optimal for performance tuning. The patch extends
the existing gem_create uAPI to let user set PAT index for the object
at creation time.
The new extension is platform independent, so UMD's can switch to using
this extension for older platforms as well, while {set, get}_caching are
still supported on these legacy paltforms for compatibility reason.

Cc: Chris Wilson 
Cc: Matt Roper 
Signed-off-by: Fei Yang 
Reviewed-by: Andi Shyti 



Just like the protected content uAPI, there is no way for userspace to 
tell this feature is available other than trying using it.


Given the issues with protected content, is it not thing we could want 
to add?



Thanks,


-Lionel



---
  drivers/gpu/drm/i915/gem/i915_gem_create.c | 33 
  include/uapi/drm/i915_drm.h| 36 ++
  tools/include/uapi/drm/i915_drm.h  | 36 ++
  3 files changed, 105 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_create.c 
b/drivers/gpu/drm/i915/gem/i915_gem_create.c
index e76c9703680e..1c6e2034d28e 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_create.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_create.c
@@ -244,6 +244,7 @@ struct create_ext {
unsigned int n_placements;
unsigned int placement_mask;
unsigned long flags;
+   unsigned int pat_index;
  };
  
  static void repr_placements(char *buf, size_t size,

@@ -393,11 +394,39 @@ static int ext_set_protected(struct i915_user_extension 
__user *base, void *data
return 0;
  }
  
+static int ext_set_pat(struct i915_user_extension __user *base, void *data)

+{
+   struct create_ext *ext_data = data;
+   struct drm_i915_private *i915 = ext_data->i915;
+   struct drm_i915_gem_create_ext_set_pat ext;
+   unsigned int max_pat_index;
+
+   BUILD_BUG_ON(sizeof(struct drm_i915_gem_create_ext_set_pat) !=
+offsetofend(struct drm_i915_gem_create_ext_set_pat, rsvd));
+
+   if (copy_from_user(, base, sizeof(ext)))
+   return -EFAULT;
+
+   max_pat_index = INTEL_INFO(i915)->max_pat_index;
+
+   if (ext.pat_index > max_pat_index) {
+   drm_dbg(>drm, "PAT index is invalid: %u\n",
+   ext.pat_index);
+   return -EINVAL;
+   }
+
+   ext_data->pat_index = ext.pat_index;
+
+   return 0;
+}
+
  static const i915_user_extension_fn create_extensions[] = {
[I915_GEM_CREATE_EXT_MEMORY_REGIONS] = ext_set_placements,
[I915_GEM_CREATE_EXT_PROTECTED_CONTENT] = ext_set_protected,
+   [I915_GEM_CREATE_EXT_SET_PAT] = ext_set_pat,
  };
  
+#define PAT_INDEX_NOT_SET	0x

  /**
   * Creates a new mm object and returns a handle to it.
   * @dev: drm device pointer
@@ -417,6 +446,7 @@ i915_gem_create_ext_ioctl(struct drm_device *dev, void 
*data,
if (args->flags & ~I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS)
return -EINVAL;
  
+	ext_data.pat_index = PAT_INDEX_NOT_SET;

ret = i915_user_extensions(u64_to_user_ptr(args->extensions),
   create_extensions,
   ARRAY_SIZE(create_extensions),
@@ -453,5 +483,8 @@ i915_gem_create_ext_ioctl(struct drm_device *dev, void 
*data,
if (IS_ERR(obj))
return PTR_ERR(obj);
  
+	if (ext_data.pat_index != PAT_INDEX_NOT_SET)

+   i915_gem_object_set_pat_index(obj, ext_data.pat_index);
+
return i915_gem_publish(obj, file, >size, >handle);
  }
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index dba7c5a5b25e..03c5c314846e 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -3630,9 +3630,13 @@ struct drm_i915_gem_create_ext {
 *
 * For I915_GEM_CREATE_EXT_PROTECTED_CONTENT usage see
 * struct drm_i915_gem_create_ext_protected_content.
+*
+* For I915_GEM_CREATE_EXT_SET_PAT usage see
+* struct drm_i915_gem_create_ext_set_pat.
 */
  #define I915_GEM_CREATE_EXT_MEMORY_REGIONS 0
  #define I915_GEM_CREATE_EXT_PROTECTED_CONTENT 1
+#define I915_GEM_CREATE_EXT_SET_PAT 2
__u64 extensions;
  };
  
@@ -3747,6 +3751,38 @@ struct drm_i915_gem_create_ext_protected_content {

__u32 flags;
  };
  
+/**

+ * struct drm_i915_gem_create_ext_set_pat - The
+ * I915_GEM_CREATE_EXT_SET_PAT extension.
+ *
+ * If this extension is provided, the specified caching policy (PAT index) is
+ * applied to the buffer object.
+ *
+ * Below is an example on how to create an object with specific caching

Re: [Intel-gfx] [PATCH] drm/i915: disable sampler indirect state in bindless heap

2023-04-03 Thread Lionel Landwerlin


On 03/04/2023 21:22, Kalvala, Haridhar wrote:


On 3/31/2023 12:35 PM, Kalvala, Haridhar wrote:


On 3/30/2023 10:49 PM, Lionel Landwerlin wrote:

On 29/03/2023 01:49, Matt Atwood wrote:

On Tue, Mar 28, 2023 at 04:14:33PM +0530, Kalvala, Haridhar wrote:

On 3/9/2023 8:56 PM, Lionel Landwerlin wrote:
By default the indirect state sampler data (border colors) are 
stored
in the same heap as the SAMPLER_STATE structure. For userspace 
drivers

that can be 2 different heaps (dynamic state heap & bindless sampler
state heap). This means that border colors have to copied in 2
different places so that the same SAMPLER_STATE structure find the
right data.

This change is forcing the indirect state sampler data to only be in
the dynamic state pool (more convinient for userspace drivers, they
only have to have one copy of the border colors). This is 
reproducing

the behavior of the Windows drivers.


Bspec:46052



Sorry, missed your answer.


Should I just add the Bspec number to the commit message ?


Thanks,


-Lionel



Signed-off-by: Lionel Landwerlin 
Cc: sta...@vger.kernel.org
---
   drivers/gpu/drm/i915/gt/intel_gt_regs.h |  1 +
   drivers/gpu/drm/i915/gt/intel_workarounds.c | 17 
+

   2 files changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h

index 08d76aa06974c..1aaa471d08c56 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1141,6 +1141,7 @@
   #define   ENABLE_SMALLPL    REG_BIT(15)
   #define   SC_DISABLE_POWER_OPTIMIZATION_EBB REG_BIT(9)
   #define   GEN11_SAMPLER_ENABLE_HEADLESS_MSG REG_BIT(5)
+#define   GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE REG_BIT(0)
   #define GEN9_HALF_SLICE_CHICKEN7 MCR_REG(0xe194)
   #define   DG2_DISABLE_ROUND_ENABLE_ALLOW_FOR_SSLA REG_BIT(15)
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c

index 32aa1647721ae..734b64e714647 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -2542,6 +2542,23 @@ rcs_engine_wa_init(struct intel_engine_cs 
*engine, struct i915_wa_list *wal)

    ENABLE_SMALLPL);
   }
+    if (GRAPHICS_VER(i915) >= 11) {

Hi Lionel,

Not sure should this implementation be part of 
"rcs_engine_wa_init" or

"general_render_compute_wa_init" ?



I checked with Matt Ropper as well, looks like this implementation 
should be part of "general_render_compute_wa_init".



I did send a v3 of the patch last Thursday to address this.

Let me know if that's good.


Thanks,


-Lionel







+    /* This is not a Wa (although referred to as
+ * WaSetInidrectStateOverride in places), this allows
+ * applications that reference sampler states through
+ * the BindlessSamplerStateBaseAddress to have their
+ * border color relative to DynamicStateBaseAddress
+ * rather than BindlessSamplerStateBaseAddress.
+ *
+ * Otherwise SAMPLER_STATE border colors have to be
+ * copied in multiple heaps (DynamicStateBaseAddress &
+ * BindlessSamplerStateBaseAddress)
+ */
+    wa_mcr_masked_en(wal,
+ GEN10_SAMPLER_MODE,
  since we checking the condition for GEN11 or above, can this 
register be

defined as GEN11_SAMPLER_MODE
We use the name of the first time the register was introduced, gen 
10 is

fine here.

ok

+ GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE);
+    }
+
   if (GRAPHICS_VER(i915) == 11) {
   /* This is not an Wa. Enable for better image quality */
   wa_masked_en(wal,

--
Regards,
Haridhar Kalvala


Regards,
MattA

Re: [Intel-gfx] [v2] drm/i915: disable sampler indirect state in bindless heap

2023-03-30 Thread Lionel Landwerlin


On 30/03/2023 22:38, Matt Atwood wrote:

On Thu, Mar 30, 2023 at 12:27:33PM -0700, Matt Atwood wrote:

On Thu, Mar 30, 2023 at 08:47:40PM +0300, Lionel Landwerlin wrote:

By default the indirect state sampler data (border colors) are stored
in the same heap as the SAMPLER_STATE structure. For userspace drivers
that can be 2 different heaps (dynamic state heap & bindless sampler
state heap). This means that border colors have to copied in 2
different places so that the same SAMPLER_STATE structure find the
right data.

This change is forcing the indirect state sampler data to only be in
the dynamic state pool (more convinient for userspace drivers, they

   convenient

only have to have one copy of the border colors). This is reproducing
the behavior of the Windows drivers.

BSpec: 46052


Assuming still good CI results..
Reviewed-by: Matt Atwood 

My mistake version 3 required. comments inline.

Signed-off-by: Lionel Landwerlin 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/i915/gt/intel_gt_regs.h |  1 +
  drivers/gpu/drm/i915/gt/intel_workarounds.c | 19 +++
  2 files changed, 20 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index 4aecb5a7b6318..f298dc461a72f 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1144,6 +1144,7 @@
  #define   ENABLE_SMALLPL  REG_BIT(15)
  #define   SC_DISABLE_POWER_OPTIMIZATION_EBB   REG_BIT(9)
  #define   GEN11_SAMPLER_ENABLE_HEADLESS_MSG   REG_BIT(5)
+#define   GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE  REG_BIT(0)
  
  #define GEN9_HALF_SLICE_CHICKEN7		MCR_REG(0xe194)

  #define   DG2_DISABLE_ROUND_ENABLE_ALLOW_FOR_SSLA REG_BIT(15)
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index e7ee24bcad893..0ce1c8c23c631 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -2535,6 +2535,25 @@ rcs_engine_wa_init(struct intel_engine_cs *engine, 
struct i915_wa_list *wal)
 ENABLE_SMALLPL);
}
  

This workaround belongs in general render workarounds not rcs, as per
the address space in i915_regs.h 0x2xxx.

#define RENDER_RING_BASE0x02000



Thanks makes sense.


-Lionel






+   if (GRAPHICS_VER(i915) >= 11) {
+   /* This is not a Wa (although referred to as
+* WaSetInidrectStateOverride in places), this allows
+* applications that reference sampler states through
+* the BindlessSamplerStateBaseAddress to have their
+* border color relative to DynamicStateBaseAddress
+* rather than BindlessSamplerStateBaseAddress.
+*
+* Otherwise SAMPLER_STATE border colors have to be
+* copied in multiple heaps (DynamicStateBaseAddress &
+* BindlessSamplerStateBaseAddress)
+*
+* BSpec: 46052
+*/
+   wa_mcr_masked_en(wal,
+GEN10_SAMPLER_MODE,
+GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE);
+   }
+
if (GRAPHICS_VER(i915) == 11) {
/* This is not an Wa. Enable for better image quality */
wa_masked_en(wal,
--
2.34.1


MattA

[Intel-gfx] [v3] drm/i915: disable sampler indirect state in bindless heap

2023-03-30 Thread Lionel Landwerlin

By default the indirect state sampler data (border colors) are stored
in the same heap as the SAMPLER_STATE structure. For userspace drivers
that can be 2 different heaps (dynamic state heap & bindless sampler
state heap). This means that border colors have to copied in 2
different places so that the same SAMPLER_STATE structure find the
right data.

This change is forcing the indirect state sampler data to only be in
the dynamic state pool (more convinient for userspace drivers, they
only have to have one copy of the border colors). This is reproducing
the behavior of the Windows drivers.

BSpec: 46052

Signed-off-by: Lionel Landwerlin 
Cc: sta...@vger.kernel.org
---
 drivers/gpu/drm/i915/gt/intel_gt_regs.h |  1 +
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 19 +++
 2 files changed, 20 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index 4aecb5a7b6318..f298dc461a72f 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1144,6 +1144,7 @@
 #define   ENABLE_SMALLPL   REG_BIT(15)
 #define   SC_DISABLE_POWER_OPTIMIZATION_EBBREG_BIT(9)
 #define   GEN11_SAMPLER_ENABLE_HEADLESS_MSGREG_BIT(5)
+#define   GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE  REG_BIT(0)
 
 #define GEN9_HALF_SLICE_CHICKEN7   MCR_REG(0xe194)
 #define   DG2_DISABLE_ROUND_ENABLE_ALLOW_FOR_SSLA  REG_BIT(15)
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index e7ee24bcad893..5bfc864d5fcc0 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -2971,6 +2971,25 @@ general_render_compute_wa_init(struct intel_engine_cs 
*engine, struct i915_wa_li
 
add_render_compute_tuning_settings(i915, wal);
 
+   if (GRAPHICS_VER(i915) >= 11) {
+   /* This is not a Wa (although referred to as
+* WaSetInidrectStateOverride in places), this allows
+* applications that reference sampler states through
+* the BindlessSamplerStateBaseAddress to have their
+* border color relative to DynamicStateBaseAddress
+* rather than BindlessSamplerStateBaseAddress.
+*
+* Otherwise SAMPLER_STATE border colors have to be
+* copied in multiple heaps (DynamicStateBaseAddress &
+* BindlessSamplerStateBaseAddress)
+*
+* BSpec: 46052
+*/
+   wa_mcr_masked_en(wal,
+GEN10_SAMPLER_MODE,
+GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE);
+   }
+
if (IS_MTL_GRAPHICS_STEP(i915, M, STEP_A0, STEP_B0) ||
IS_MTL_GRAPHICS_STEP(i915, P, STEP_A0, STEP_B0) ||
IS_DG2_GRAPHICS_STEP(i915, G10, STEP_B0, STEP_FOREVER) ||
-- 
2.34.1

[Intel-gfx] [v2] drm/i915: disable sampler indirect state in bindless heap

2023-03-30 Thread Lionel Landwerlin

By default the indirect state sampler data (border colors) are stored
in the same heap as the SAMPLER_STATE structure. For userspace drivers
that can be 2 different heaps (dynamic state heap & bindless sampler
state heap). This means that border colors have to copied in 2
different places so that the same SAMPLER_STATE structure find the
right data.

This change is forcing the indirect state sampler data to only be in
the dynamic state pool (more convinient for userspace drivers, they
only have to have one copy of the border colors). This is reproducing
the behavior of the Windows drivers.

BSpec: 46052

Signed-off-by: Lionel Landwerlin 
Cc: sta...@vger.kernel.org
---
 drivers/gpu/drm/i915/gt/intel_gt_regs.h |  1 +
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 19 +++
 2 files changed, 20 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index 4aecb5a7b6318..f298dc461a72f 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1144,6 +1144,7 @@
 #define   ENABLE_SMALLPL   REG_BIT(15)
 #define   SC_DISABLE_POWER_OPTIMIZATION_EBBREG_BIT(9)
 #define   GEN11_SAMPLER_ENABLE_HEADLESS_MSGREG_BIT(5)
+#define   GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE  REG_BIT(0)
 
 #define GEN9_HALF_SLICE_CHICKEN7   MCR_REG(0xe194)
 #define   DG2_DISABLE_ROUND_ENABLE_ALLOW_FOR_SSLA  REG_BIT(15)
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index e7ee24bcad893..0ce1c8c23c631 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -2535,6 +2535,25 @@ rcs_engine_wa_init(struct intel_engine_cs *engine, 
struct i915_wa_list *wal)
 ENABLE_SMALLPL);
}
 
+   if (GRAPHICS_VER(i915) >= 11) {
+   /* This is not a Wa (although referred to as
+* WaSetInidrectStateOverride in places), this allows
+* applications that reference sampler states through
+* the BindlessSamplerStateBaseAddress to have their
+* border color relative to DynamicStateBaseAddress
+* rather than BindlessSamplerStateBaseAddress.
+*
+* Otherwise SAMPLER_STATE border colors have to be
+* copied in multiple heaps (DynamicStateBaseAddress &
+* BindlessSamplerStateBaseAddress)
+*
+* BSpec: 46052
+*/
+   wa_mcr_masked_en(wal,
+GEN10_SAMPLER_MODE,
+GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE);
+   }
+
if (GRAPHICS_VER(i915) == 11) {
/* This is not an Wa. Enable for better image quality */
wa_masked_en(wal,
-- 
2.34.1

Re: [Intel-gfx] [PATCH] drm/i915: disable sampler indirect state in bindless heap

2023-03-30 Thread Lionel Landwerlin


On 29/03/2023 01:49, Matt Atwood wrote:

On Tue, Mar 28, 2023 at 04:14:33PM +0530, Kalvala, Haridhar wrote:

On 3/9/2023 8:56 PM, Lionel Landwerlin wrote:

By default the indirect state sampler data (border colors) are stored
in the same heap as the SAMPLER_STATE structure. For userspace drivers
that can be 2 different heaps (dynamic state heap & bindless sampler
state heap). This means that border colors have to copied in 2
different places so that the same SAMPLER_STATE structure find the
right data.

This change is forcing the indirect state sampler data to only be in
the dynamic state pool (more convinient for userspace drivers, they
only have to have one copy of the border colors). This is reproducing
the behavior of the Windows drivers.


Bspec:46052



Sorry, missed your answer.


Should I just add the Bspec number to the commit message ?


Thanks,


-Lionel



Signed-off-by: Lionel Landwerlin 
Cc: sta...@vger.kernel.org
---
   drivers/gpu/drm/i915/gt/intel_gt_regs.h |  1 +
   drivers/gpu/drm/i915/gt/intel_workarounds.c | 17 +
   2 files changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index 08d76aa06974c..1aaa471d08c56 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1141,6 +1141,7 @@
   #define   ENABLE_SMALLPL REG_BIT(15)
   #define   SC_DISABLE_POWER_OPTIMIZATION_EBB  REG_BIT(9)
   #define   GEN11_SAMPLER_ENABLE_HEADLESS_MSG  REG_BIT(5)
+#define   GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE  REG_BIT(0)
   #define GEN9_HALF_SLICE_CHICKEN7 MCR_REG(0xe194)
   #define   DG2_DISABLE_ROUND_ENABLE_ALLOW_FOR_SSLAREG_BIT(15)
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 32aa1647721ae..734b64e714647 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -2542,6 +2542,23 @@ rcs_engine_wa_init(struct intel_engine_cs *engine, 
struct i915_wa_list *wal)
 ENABLE_SMALLPL);
}
+   if (GRAPHICS_VER(i915) >= 11) {

Hi Lionel,

Not sure should this implementation be part of "rcs_engine_wa_init" or
"general_render_compute_wa_init".


+   /* This is not a Wa (although referred to as
+* WaSetInidrectStateOverride in places), this allows
+* applications that reference sampler states through
+* the BindlessSamplerStateBaseAddress to have their
+* border color relative to DynamicStateBaseAddress
+* rather than BindlessSamplerStateBaseAddress.
+*
+* Otherwise SAMPLER_STATE border colors have to be
+* copied in multiple heaps (DynamicStateBaseAddress &
+* BindlessSamplerStateBaseAddress)
+*/
+   wa_mcr_masked_en(wal,
+GEN10_SAMPLER_MODE,

  since we checking the condition for GEN11 or above, can this register be
defined as GEN11_SAMPLER_MODE

We use the name of the first time the register was introduced, gen 10 is
fine here.

+GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE);
+   }
+
if (GRAPHICS_VER(i915) == 11) {
/* This is not an Wa. Enable for better image quality */
wa_masked_en(wal,

--
Regards,
Haridhar Kalvala


Regards,
MattA

Re: [Intel-gfx] [PATCH v6 5/8] drm/i915/pxp: Add ARB session creation and cleanup

2023-03-27 Thread Lionel Landwerlin


On 26/03/2023 14:18, Rodrigo Vivi wrote:

On Sat, Mar 25, 2023 at 02:19:21AM -0400, Teres Alexis, Alan Previn wrote:

alan:snip

@@ -353,8 +367,20 @@ int intel_pxp_start(struct intel_pxp *pxp)
alan:snip

+   if (HAS_ENGINE(pxp->ctrl_gt, GSC0)) {
+   /*
+* GSC-fw loading, GSC-proxy init (requiring an mei component 
driver) and
+* HuC-fw loading must all occur first before we start 
requesting for PXP
+* sessions. Checking HuC authentication (the last dependency)  
will suffice.
+* Let's use a much larger 8 second timeout considering all the 
types of
+* dependencies prior to that.
+*/
+   if (wait_for(intel_huc_is_authenticated(>ctrl_gt->uc.huc), 
8000))

This big timeout needs an ack from userspace drivers, as intel_pxp_start
is called during context creation and the current way to query if the
feature is supported is to create a protected context. Unfortunately, we
do need to wait to confirm that PXP is available (although in most cases
it shouldn't take even close to 8 secs), because until everything is
setup we're not sure if things will work as expected. I see 2 potential
mitigations in case the timeout doesn't work as-is:

1) we return -EAGAIN (or another dedicated error code) to userspace if
the prerequisite steps aren't done yet. This would indicate that the
feature is there, but that we haven't completed the setup yet. The
caller can then decide if they want to retry immediately or later. Pro:
more flexibility for userspace; Cons: new interface return code.

2) we add a getparam to say if PXP is supported in HW and the support is
compiled in i915. Userspace can query this as a way to check the feature
support and only create the context if they actually need it for PXP
operations. Pro: simpler kernel implementation; Cons: new getparam, plus
even if the getparam returns true the pxp_start could later fail, so
userspace needs to handle that case.


alan: I've cc'd Rodrigo, Joonas and Lionel. Folks - what are your thoughts on 
above issue?
Recap: On MTL, only when creating a GEM Protected (PXP) context for the very 
first time after
a driver load, it will be dependent on (1) loading the GSC firmware, (2) GuC 
loading the HuC
firmware and (3) GSC authenticating the HuC fw. But step 3 also depends on 
additional
GSC-proxy-init steps that depend on a new mei-gsc-proxy component driver. I'd 
used the
8 second number based on offline conversations with Daniele but that is a 
worse-case.
Alternatively, should we change UAPI instead to return -EAGAIN as per Daniele's 
proposal?
I believe we've had the get-param conversation offline recently and the 
direction was to
stick with attempting to create the context as it is normal in 3D UMD when it 
comes to
testing capabilities for other features too.

Thoughts?

I like the option 1 more. This extra return handling won't break compatibility.



I like option 2 better because we have to report support as fast as we 
can when enumerating devices on the system for example.


If I understand correctly, with the get param, most apps won't ever be 
blocking on any PXP stuff if they don't use it.


Only the ones that require protected support might block.


-Lionel

[Intel-gfx] [PATCH] drm/i915: disable sampler indirect state in bindless heap

2023-03-09 Thread Lionel Landwerlin

By default the indirect state sampler data (border colors) are stored
in the same heap as the SAMPLER_STATE structure. For userspace drivers
that can be 2 different heaps (dynamic state heap & bindless sampler
state heap). This means that border colors have to copied in 2
different places so that the same SAMPLER_STATE structure find the
right data.

This change is forcing the indirect state sampler data to only be in
the dynamic state pool (more convinient for userspace drivers, they
only have to have one copy of the border colors). This is reproducing
the behavior of the Windows drivers.

Signed-off-by: Lionel Landwerlin 
Cc: sta...@vger.kernel.org
---
 drivers/gpu/drm/i915/gt/intel_gt_regs.h |  1 +
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 17 +
 2 files changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index 08d76aa06974c..1aaa471d08c56 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1141,6 +1141,7 @@
 #define   ENABLE_SMALLPL   REG_BIT(15)
 #define   SC_DISABLE_POWER_OPTIMIZATION_EBBREG_BIT(9)
 #define   GEN11_SAMPLER_ENABLE_HEADLESS_MSGREG_BIT(5)
+#define   GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE  REG_BIT(0)
 
 #define GEN9_HALF_SLICE_CHICKEN7   MCR_REG(0xe194)
 #define   DG2_DISABLE_ROUND_ENABLE_ALLOW_FOR_SSLA  REG_BIT(15)
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 32aa1647721ae..734b64e714647 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -2542,6 +2542,23 @@ rcs_engine_wa_init(struct intel_engine_cs *engine, 
struct i915_wa_list *wal)
 ENABLE_SMALLPL);
}
 
+   if (GRAPHICS_VER(i915) >= 11) {
+   /* This is not a Wa (although referred to as
+* WaSetInidrectStateOverride in places), this allows
+* applications that reference sampler states through
+* the BindlessSamplerStateBaseAddress to have their
+* border color relative to DynamicStateBaseAddress
+* rather than BindlessSamplerStateBaseAddress.
+*
+* Otherwise SAMPLER_STATE border colors have to be
+* copied in multiple heaps (DynamicStateBaseAddress &
+* BindlessSamplerStateBaseAddress)
+*/
+   wa_mcr_masked_en(wal,
+GEN10_SAMPLER_MODE,
+GEN11_INDIRECT_STATE_BASE_ADDR_OVERRIDE);
+   }
+
if (GRAPHICS_VER(i915) == 11) {
/* This is not an Wa. Enable for better image quality */
wa_masked_en(wal,
-- 
2.34.1

Re: [Intel-gfx] [PATCH 0/6] drm/i915: Fix up and test RING_TIMESTAMP on gen4-6

2022-10-31 Thread Lionel Landwerlin


On 31/10/2022 15:56, Ville Syrjala wrote:

From: Ville Syrjälä 

Correct the ring timestamp frequency for gen4/5, and run the
relevant selftests for gen4-6.

I've posted at least most of this before, but stuff changed
in the meantinme so it needed a rebase.

Ville Syrjälä (6):
   drm/i915: Fix cs timestamp frequency for ctg/elk/ilk
   drm/i915: Stop claiming cs timestamp frquency on gen2/3
   drm/i915: Fix cs timestamp frequency for cl/bw
   drm/i915/selftests: Run MI_BB perf selftests on SNB
   drm/i915/selftests: Test RING_TIMESTAMP on gen4/5
   drm/i915/selftests: Run the perf MI_BB tests on gen4/5

  .../gpu/drm/i915/gt/intel_gt_clock_utils.c| 38 ---
  drivers/gpu/drm/i915/gt/selftest_engine_cs.c  | 22 +--
  drivers/gpu/drm/i915/gt/selftest_gt_pm.c  | 36 --
  3 files changed, 67 insertions(+), 29 deletions(-)


Reviewed-by: Lionel Landwerlin

Re: [Intel-gfx] [PATCH v6 00/16] Add DG2 OA support

2022-10-27 Thread Lionel Landwerlin


Thanks Umesh,

Is it looking good to land?
Looking forward to have this in Mesa upstream.

-Lionel

On 27/10/2022 01:20, Umesh Nerlige Ramappa wrote:

Add OA format support for DG2 and various fixes for DG2.

This series has 2 uapi changes listed below:

1) drm/i915/perf: Add OAG and OAR formats for DG2

DG2 has new OA formats defined that can be selected by the
user. The UMD changes that are consumed by GPUvis are:
https://patchwork.freedesktop.org/patch/504456/?series=107633=5

Mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18893

2) drm/i915/perf: Apply Wa_18013179988

DG2 has a bug where the OA timestamp does not tick at the CS timestamp
frequency. Instead it ticks at a multiple that is determined from the
CTC_SHIFT value in RPM_CONFIG. Since the timestamp is used by UMD to
make sense of all the counters in the report, expose the OA timestamp
frequency to the user. The interface is generic and applies to all
platforms. On platforms where the bug is not present, this returns the
CS timestamp frequency. UMD specific changes consumed by GPUvis are:
https://patchwork.freedesktop.org/patch/504464/?series=107633=5

Mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18893

v2:
- Add review comments
- Update uapi changes in cover letter
- Drop patches for non-production platforms
drm/i915/perf: Use helpers to process reports w.r.t. OA buffer size
drm/i915/perf: Add Wa_16010703925:dg2

- Drop 64-bit OA format changes for now
drm/i915/perf: Parse 64bit report header formats correctly
drm/i915/perf: Add Wa_1608133521:dg2

v3:
- Add review comments to patches 02, 04, 05, 14
- Drop Acks

v4:
- Add review comments to patch 04
- Update R-bs
- Add MR links to patches 02 and 12

v5:
- Drop unrelated comment
- Rebase and fix MCR reg write
- On pre-gen12, EU flex config is saved/restored in the context image, so
   save/restore EU flex config only for gen12.

v6:
- Fix checkpatch issues

Test-with: 20221025200709.83314-1-umesh.nerlige.rama...@intel.com
Signed-off-by: Umesh Nerlige Ramappa 

Lionel Landwerlin (1):
   drm/i915/perf: complete programming whitelisting for XEHPSDV

Umesh Nerlige Ramappa (14):
   drm/i915/perf: Fix OA filtering logic for GuC mode
   drm/i915/perf: Add 32-bit OAG and OAR formats for DG2
   drm/i915/perf: Fix noa wait predication for DG2
   drm/i915/perf: Determine gen12 oa ctx offset at runtime
   drm/i915/perf: Enable bytes per clock reporting in OA
   drm/i915/perf: Simply use stream->ctx
   drm/i915/perf: Move gt-specific data from i915->perf to gt->perf
   drm/i915/perf: Replace gt->perf.lock with stream->lock for file ops
   drm/i915/perf: Use gt-specific ggtt for OA and noa-wait buffers
   drm/i915/perf: Store a pointer to oa_format in oa_buffer
   drm/i915/perf: Add Wa_1508761755:dg2
   drm/i915/perf: Apply Wa_18013179988
   drm/i915/perf: Save/restore EU flex counters across reset
   drm/i915/perf: Enable OA for DG2

Vinay Belgaumkar (1):
   drm/i915/guc: Support OA when Wa_16011777198 is enabled

  drivers/gpu/drm/i915/gt/intel_engine_regs.h   |   1 +
  drivers/gpu/drm/i915/gt/intel_gpu_commands.h  |   4 +
  drivers/gpu/drm/i915/gt/intel_gt_regs.h   |   1 +
  drivers/gpu/drm/i915/gt/intel_gt_types.h  |   3 +
  drivers/gpu/drm/i915/gt/intel_lrc.h   |   2 +
  drivers/gpu/drm/i915/gt/intel_sseu.c  |   4 +-
  .../drm/i915/gt/uc/abi/guc_actions_slpc_abi.h |   9 +
  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  10 +
  drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c   |  66 ++
  drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.h   |   2 +
  drivers/gpu/drm/i915/i915_drv.h   |   5 +
  drivers/gpu/drm/i915/i915_getparam.c  |   3 +
  drivers/gpu/drm/i915/i915_pci.c   |   2 +
  drivers/gpu/drm/i915/i915_perf.c  | 576 ++
  drivers/gpu/drm/i915/i915_perf.h  |   2 +
  drivers/gpu/drm/i915/i915_perf_oa_regs.h  |   6 +-
  drivers/gpu/drm/i915/i915_perf_types.h|  47 +-
  drivers/gpu/drm/i915/intel_device_info.h  |   2 +
  drivers/gpu/drm/i915/selftests/i915_perf.c|  16 +-
  include/uapi/drm/i915_drm.h   |  10 +
  20 files changed, 630 insertions(+), 141 deletions(-)

Re: [Intel-gfx] [PATCH 01/19] drm/i915/perf: Fix OA filtering logic for GuC mode

2022-09-22 Thread Lionel Landwerlin


On 15/09/2022 02:13, Umesh Nerlige Ramappa wrote:

On Wed, Sep 14, 2022 at 03:26:15PM -0700, Umesh Nerlige Ramappa wrote:

On Tue, Sep 06, 2022 at 09:39:33PM +0300, Lionel Landwerlin wrote:

On 06/09/2022 20:39, Umesh Nerlige Ramappa wrote:

On Tue, Sep 06, 2022 at 05:33:00PM +0300, Lionel Landwerlin wrote:

On 23/08/2022 23:41, Umesh Nerlige Ramappa wrote:
With GuC mode of submission, GuC is in control of defining the 
context id field
that is part of the OA reports. To filter reports, UMD and KMD 
must know what sw
context id was chosen by GuC. There is not interface between KMD 
and GuC to
determine this, so read the upper-dword of EXECLIST_STATUS to 
filter/squash OA

reports for the specific context.

Signed-off-by: Umesh Nerlige Ramappa 




I assume you checked with GuC that this doesn't change as the 
context is running?


Correct.



With i915/execlist submission mode, we had to ask i915 to pin the 
sw_id/ctx_id.




From GuC perspective, the context id can change once KMD 
de-registers the context and that will not happen while the context 
is in use.


Thanks,
Umesh



Thanks Umesh,


Maybe I should have been more precise in my question :


Can the ID change while the i915-perf stream is opened?

Because the ID not changing while the context is running makes sense.

But since the number of available IDs is limited to 2k or something 
on Gfx12, it's possible the GuC has to reuse IDs if too many apps 
want to run during the period of time while i915-perf is active and 
filtering.




available guc ids are 64k with 4k reserved for multi-lrc, so GuC may 
have to reuse ids once 60k ids are used up.


Spoke to the GuC team again and if there are a lot of contexts (> 60K) 
running, there is a possibility of the context id being recycled. In 
that case, the capture would be broken. I would track this as a 
separate JIRA and follow up on a solution.


From OA use case perspective, are we interested in monitoring just one 
hardware context? If we make sure this context is not stolen, are we 
good?

Thanks,
Umesh



Yep, we only care about that one ID not changing.


Thanks,

-Lionel






Thanks,
Umesh



-Lionel






If that's not the case then filtering is broken.


-Lionel



---
 drivers/gpu/drm/i915/gt/intel_lrc.h |   2 +
 drivers/gpu/drm/i915/i915_perf.c    | 141 


 2 files changed, 124 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.h 
b/drivers/gpu/drm/i915/gt/intel_lrc.h

index a390f0813c8b..7111bae759f3 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.h
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.h
@@ -110,6 +110,8 @@ enum {
 #define XEHP_SW_CTX_ID_WIDTH    16
 #define XEHP_SW_COUNTER_SHIFT    58
 #define XEHP_SW_COUNTER_WIDTH    6
+#define GEN12_GUC_SW_CTX_ID_SHIFT    39
+#define GEN12_GUC_SW_CTX_ID_WIDTH    16
 static inline void lrc_runtime_start(struct intel_context *ce)
 {
diff --git a/drivers/gpu/drm/i915/i915_perf.c 
b/drivers/gpu/drm/i915/i915_perf.c

index f3c23fe9ad9c..735244a3aedd 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1233,6 +1233,125 @@ static struct intel_context 
*oa_pin_context(struct i915_perf_stream *stream)

 return stream->pinned_ctx;
 }
+static int
+__store_reg_to_mem(struct i915_request *rq, i915_reg_t reg, u32 
ggtt_offset)

+{
+    u32 *cs, cmd;
+
+    cmd = MI_STORE_REGISTER_MEM | MI_SRM_LRM_GLOBAL_GTT;
+    if (GRAPHICS_VER(rq->engine->i915) >= 8)
+    cmd++;
+
+    cs = intel_ring_begin(rq, 4);
+    if (IS_ERR(cs))
+    return PTR_ERR(cs);
+
+    *cs++ = cmd;
+    *cs++ = i915_mmio_reg_offset(reg);
+    *cs++ = ggtt_offset;
+    *cs++ = 0;
+
+    intel_ring_advance(rq, cs);
+
+    return 0;
+}
+
+static int
+__read_reg(struct intel_context *ce, i915_reg_t reg, u32 
ggtt_offset)

+{
+    struct i915_request *rq;
+    int err;
+
+    rq = i915_request_create(ce);
+    if (IS_ERR(rq))
+    return PTR_ERR(rq);
+
+    i915_request_get(rq);
+
+    err = __store_reg_to_mem(rq, reg, ggtt_offset);
+
+    i915_request_add(rq);
+    if (!err && i915_request_wait(rq, 0, HZ / 2) < 0)
+    err = -ETIME;
+
+    i915_request_put(rq);
+
+    return err;
+}
+
+static int
+gen12_guc_sw_ctx_id(struct intel_context *ce, u32 *ctx_id)
+{
+    struct i915_vma *scratch;
+    u32 *val;
+    int err;
+
+    scratch = 
__vm_create_scratch_for_read_pinned(>engine->gt->ggtt->vm, 4);

+    if (IS_ERR(scratch))
+    return PTR_ERR(scratch);
+
+    err = i915_vma_sync(scratch);
+    if (err)
+    goto err_scratch;
+
+    err = __read_reg(ce, 
RING_EXECLIST_STATUS_HI(ce->engine->mmio_base),

+ i915_ggtt_offset(scratch));
+    if (err)
+    goto err_scratch;
+
+    val = i915_gem_object_pin_map_unlocked(scratch->obj, 
I915_MAP_WB);

+    if (IS_ERR(val)) {
+    err = PTR_ERR(val);
+    goto err_scratch;
+    }
+
+    *ctx_id = *val;
+    i915_gem

Re: [Intel-gfx] [PATCH 04/19] drm/i915/perf: Determine gen12 oa ctx offset at runtime

2022-09-08 Thread Lionel Landwerlin


On 06/09/2022 23:35, Umesh Nerlige Ramappa wrote:

On Tue, Sep 06, 2022 at 10:48:50PM +0300, Lionel Landwerlin wrote:

On 23/08/2022 23:41, Umesh Nerlige Ramappa wrote:

Some SKUs of same gen12 platform may have different oactxctrl
offsets. For gen12, determine oactxctrl offsets at runtime.

Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/i915_perf.c | 149 ++-
 drivers/gpu/drm/i915/i915_perf_oa_regs.h |   2 +-
 2 files changed, 120 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c 
b/drivers/gpu/drm/i915/i915_perf.c

index 3526693d64fa..efa7eda83edd 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1363,6 +1363,67 @@ static int gen12_get_render_context_id(struct 
i915_perf_stream *stream)

 return 0;
 }
+#define MI_OPCODE(x) (((x) >> 23) & 0x3f)
+#define IS_MI_LRI_CMD(x) (MI_OPCODE(x) == MI_OPCODE(MI_INSTR(0x22, 
0)))

+#define MI_LRI_LEN(x) (((x) & 0xff) + 1)



Maybe you want to put this in intel_gpu_commands.h



+#define __valid_oactxctrl_offset(x) ((x) && (x) != U32_MAX)
+static bool __find_reg_in_lri(u32 *state, u32 reg, u32 *offset)
+{
+    u32 idx = *offset;
+    u32 len = MI_LRI_LEN(state[idx]) + idx;
+
+    idx++;
+    for (; idx < len; idx += 2)
+    if (state[idx] == reg)
+    break;
+
+    *offset = idx;
+    return state[idx] == reg;
+}
+
+static u32 __context_image_offset(struct intel_context *ce, u32 reg)
+{
+    u32 offset, len = (ce->engine->context_size - PAGE_SIZE) / 4;
+    u32 *state = ce->lrc_reg_state;
+
+    for (offset = 0; offset < len; ) {
+    if (IS_MI_LRI_CMD(state[offset])) {


I'm a bit concerned you might find other matches with this.

Because let's say you run into a 3DSTATE_SUBSLICE_HASH_TABLE 
instruction, you'll iterate the instruction dword by dword because 
you don't know how to read its length and skip to the next one.


Now some of the fields can be programmed from userspace to look like 
an MI_LRI header, so you start to read data in the wrong way.



Unfortunately I don't have a better solution. My only ask is that you 
make __find_reg_in_lri() take the context image size in parameter so 
it NEVER goes over the the context image.



To limit the risk you should run this function only one at driver 
initialization and store the found offset.




Hmm, didn't know that there may be non-LRI commands in the context 
image or user could add to the context image somehow. Does using the 
context image size alone address these issues?


Even after including the size in the logic, any reason you think we 
would be much more safer to do this from init? Is it because context 
image is not touched by user yet?



The format of the image (commands in there and their offset) is fixed 
per HW generation.


Only the date in each of the commands will vary per context.

In the case of MI_LRI, the register offsets are the same for all 
context, but the value programmed will vary per context.


So executing once should be enough to find the right offset, rather 
than  every time we open the i915-perf stream.



I think once you have the logic to make sure you never read outside the 
image it should be alright.



-Lionel




Thanks,
Umesh



Thanks,


-Lionel



+    if (__find_reg_in_lri(state, reg, ))
+    break;
+    } else {
+    offset++;
+    }
+    }
+
+    return offset < len ? offset : U32_MAX;
+}
+
+static int __set_oa_ctx_ctrl_offset(struct intel_context *ce)
+{
+    i915_reg_t reg = GEN12_OACTXCONTROL(ce->engine->mmio_base);
+    struct i915_perf *perf = >engine->i915->perf;
+    u32 saved_offset = perf->ctx_oactxctrl_offset;
+    u32 offset;
+
+    /* Do this only once. Failure is stored as offset of U32_MAX */
+    if (saved_offset)
+    return 0;
+
+    offset = __context_image_offset(ce, i915_mmio_reg_offset(reg));
+    perf->ctx_oactxctrl_offset = offset;
+
+    drm_dbg(>engine->i915->drm,
+    "%s oa ctx control at 0x%08x dword offset\n",
+    ce->engine->name, offset);
+
+    return __valid_oactxctrl_offset(offset) ? 0 : -ENODEV;
+}
+
+static bool engine_supports_mi_query(struct intel_engine_cs *engine)
+{
+    return engine->class == RENDER_CLASS;
+}
+
 /**
  * oa_get_render_ctx_id - determine and hold ctx hw id
  * @stream: An i915-perf stream opened for OA metrics
@@ -1382,6 +1443,17 @@ static int oa_get_render_ctx_id(struct 
i915_perf_stream *stream)

 if (IS_ERR(ce))
 return PTR_ERR(ce);
+    if (engine_supports_mi_query(stream->engine)) {
+    ret = __set_oa_ctx_ctrl_offset(ce);
+    if (ret && !(stream->sample_flags & SAMPLE_OA_REPORT)) {
+    intel_context_unpin(ce);
+    drm_err(>perf->i915->drm,
+    "Enabling perf query failed for %s\n",
+    stream->engine->name);
+    return r

Re: [Intel-gfx] [PATCH 10/19] drm/i915/perf: Use gt-specific ggtt for OA and noa-wait buffers

2022-09-06 Thread Lionel Landwerlin


On 06/09/2022 23:28, Umesh Nerlige Ramappa wrote:

On Tue, Sep 06, 2022 at 10:56:13PM +0300, Lionel Landwerlin wrote:

On 23/08/2022 23:41, Umesh Nerlige Ramappa wrote:
User passes uabi engine class and instance to the perf OA interface. 
Use

gt corresponding to the engine to pin the buffers to the right ggtt.

Signed-off-by: Umesh Nerlige Ramappa 


I didn't know there was a GGTT per engine.

Do I understand this correct?


No, GGTT is still per-gt. We just derive the gt from engine class 
instance passed (as in engine->gt).



Oh thanks I understand now.


Reviewed-by: Lionel Landwerlin 







Thanks,

-Lionel



---
 drivers/gpu/drm/i915/i915_perf.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c 
b/drivers/gpu/drm/i915/i915_perf.c

index 87b92d2946f4..f7621b45966c 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1765,6 +1765,7 @@ static void gen12_init_oa_buffer(struct 
i915_perf_stream *stream)

 static int alloc_oa_buffer(struct i915_perf_stream *stream)
 {
 struct drm_i915_private *i915 = stream->perf->i915;
+    struct intel_gt *gt = stream->engine->gt;
 struct drm_i915_gem_object *bo;
 struct i915_vma *vma;
 int ret;
@@ -1784,11 +1785,22 @@ static int alloc_oa_buffer(struct 
i915_perf_stream *stream)

 i915_gem_object_set_cache_coherency(bo, I915_CACHE_LLC);
 /* PreHSW required 512K alignment, HSW requires 16M */
-    vma = i915_gem_object_ggtt_pin(bo, NULL, 0, SZ_16M, 0);
+    vma = i915_vma_instance(bo, >ggtt->vm, NULL);
 if (IS_ERR(vma)) {
 ret = PTR_ERR(vma);
 goto err_unref;
 }
+
+    /*
+ * PreHSW required 512K alignment.
+ * HSW and onwards, align to requested size of OA buffer.
+ */
+    ret = i915_vma_pin(vma, 0, SZ_16M, PIN_GLOBAL | PIN_HIGH);
+    if (ret) {
+    drm_err(>i915->drm, "Failed to pin OA buffer %d\n", ret);
+    goto err_unref;
+    }
+
 stream->oa_buffer.vma = vma;
 stream->oa_buffer.vaddr =
@@ -1838,6 +1850,7 @@ static u32 *save_restore_register(struct 
i915_perf_stream *stream, u32 *cs,

 static int alloc_noa_wait(struct i915_perf_stream *stream)
 {
 struct drm_i915_private *i915 = stream->perf->i915;
+    struct intel_gt *gt = stream->engine->gt;
 struct drm_i915_gem_object *bo;
 struct i915_vma *vma;
 const u64 delay_ticks = 0x -
@@ -1878,12 +1891,16 @@ static int alloc_noa_wait(struct 
i915_perf_stream *stream)

  * multiple OA config BOs will have a jump to this address and it
  * needs to be fixed during the lifetime of the i915/perf stream.
  */
-    vma = i915_gem_object_ggtt_pin_ww(bo, , NULL, 0, 0, PIN_HIGH);
+    vma = i915_vma_instance(bo, >ggtt->vm, NULL);
 if (IS_ERR(vma)) {
 ret = PTR_ERR(vma);
 goto out_ww;
 }
+    ret = i915_vma_pin_ww(vma, , 0, 0, PIN_GLOBAL | PIN_HIGH);
+    if (ret)
+    goto out_ww;
+
 batch = cs = i915_gem_object_pin_map(bo, I915_MAP_WB);
 if (IS_ERR(batch)) {
 ret = PTR_ERR(batch);

Re: [Intel-gfx] [PATCH 02/19] drm/i915/perf: Add OA formats for DG2

2022-09-06 Thread Lionel Landwerlin


On 06/09/2022 22:46, Umesh Nerlige Ramappa wrote:

On Tue, Sep 06, 2022 at 10:35:16PM +0300, Lionel Landwerlin wrote:

On 23/08/2022 23:41, Umesh Nerlige Ramappa wrote:

Add new OA formats for DG2. Some of the newer OA formats are not
multples of 64 bytes and are not powers of 2. For those formats, adjust
hw_tail accordingly when checking for new reports.

Signed-off-by: Umesh Nerlige Ramappa 


Apart from the coding style issue :


Reviewed-by: Lionel Landwerlin 



---
 drivers/gpu/drm/i915/i915_perf.c | 63 
 include/uapi/drm/i915_drm.h  |  6 +++
 2 files changed, 46 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c 
b/drivers/gpu/drm/i915/i915_perf.c

index 735244a3aedd..c8331b549d31 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -306,7 +306,8 @@ static u32 i915_oa_max_sample_rate = 10;
 /* XXX: beware if future OA HW adds new report formats that the 
current
  * code assumes all reports have a power-of-two size and ~(size - 
1) can

- * be used as a mask to align the OA tail pointer.
+ * be used as a mask to align the OA tail pointer. In some of the
+ * formats, R is used to denote reserved field.
  */
 static const struct i915_oa_format oa_formats[I915_OA_FORMAT_MAX] = {
 [I915_OA_FORMAT_A13]    = { 0, 64 },
@@ -320,6 +321,10 @@ static const struct i915_oa_format 
oa_formats[I915_OA_FORMAT_MAX] = {

 [I915_OA_FORMAT_A12]    = { 0, 64 },
 [I915_OA_FORMAT_A12_B8_C8]    = { 2, 128 },
 [I915_OA_FORMAT_A32u40_A4u32_B8_C8] = { 5, 256 },
+    [I915_OAR_FORMAT_A32u40_A4u32_B8_C8]    = { 5, 256 },
+    [I915_OA_FORMAT_A24u40_A14u32_B8_C8]    = { 5, 256 },
+    [I915_OAR_FORMAT_A36u64_B8_C8]    = { 1, 384 },
+    [I915_OA_FORMAT_A38u64_R2u64_B8_C8]    = { 1, 448 },
 };
 #define SAMPLE_OA_REPORT  (1<<0)
@@ -467,6 +472,7 @@ static bool oa_buffer_check_unlocked(struct 
i915_perf_stream *stream)

 bool pollin;
 u32 hw_tail;
 u64 now;
+    u32 partial_report_size;
 /* We have to consider the (unlikely) possibility that read() 
errors
  * could result in an OA buffer reset which might reset the 
head and
@@ -476,10 +482,16 @@ static bool oa_buffer_check_unlocked(struct 
i915_perf_stream *stream)

 hw_tail = stream->perf->ops.oa_hw_tail_read(stream);
-    /* The tail pointer increases in 64 byte increments,
- * not in report_size steps...
+    /* The tail pointer increases in 64 byte increments, whereas 
report

+ * sizes need not be integral multiples or 64 or powers of 2.
+ * Compute potentially partially landed report in the OA buffer
  */
-    hw_tail &= ~(report_size - 1);
+    partial_report_size = OA_TAKEN(hw_tail, stream->oa_buffer.tail);
+    partial_report_size %= report_size;
+
+    /* Subtract partial amount off the tail */
+    hw_tail = gtt_offset + ((hw_tail - partial_report_size) &
+    (stream->oa_buffer.vma->size - 1));
 now = ktime_get_mono_fast_ns();
@@ -601,6 +613,8 @@ static int append_oa_sample(struct 
i915_perf_stream *stream,

 {
 int report_size = stream->oa_buffer.format_size;
 struct drm_i915_perf_record_header header;
+    int report_size_partial;
+    u8 *oa_buf_end;
 header.type = DRM_I915_PERF_RECORD_SAMPLE;
 header.pad = 0;
@@ -614,7 +628,19 @@ static int append_oa_sample(struct 
i915_perf_stream *stream,

 return -EFAULT;
 buf += sizeof(header);
-    if (copy_to_user(buf, report, report_size))
+    oa_buf_end = stream->oa_buffer.vaddr +
+ stream->oa_buffer.vma->size;
+    report_size_partial = oa_buf_end - report;
+
+    if (report_size_partial < report_size) {
+    if(copy_to_user(buf, report, report_size_partial))
+    return -EFAULT;
+    buf += report_size_partial;
+
+    if(copy_to_user(buf, stream->oa_buffer.vaddr,
+    report_size - report_size_partial))
+    return -EFAULT;


I think the coding style requires you to use if () not if()



Will fix.



Just a suggestion : you could make this code deal with the partial 
bit as the main bit of the function :



oa_buf_end = stream->oa_buffer.vaddr +
 stream->oa_buffer.vma->size;

report_size_partial = oa_buf_end - report;

if (copy_to_user(buf, report, report_size_partial))
return -EFAULT;
buf += report_size_partial;


This ^ may not work because append_oa_sample is appending exactly one 
report to the user buffer, whereas the above may append more than one.


Thanks,
Umesh



Ah I see, thanks for pointing this out.

-Lionel






if (report_size_partial < report_size &&
   copy_to_user(buf, stream->oa_buffer.vaddr,
    report_size - report_size_partial))
return -EFAULT;
buf += report_size - report_size_partial;



+    } else if (copy_to_user(buf, report, report_size))
 return -EFAULT;
 (*offset) += header.size;
@@ -684,8 +710,8 @@ static int

Re: [Intel-gfx] [PATCH 11/19] drm/i915/perf: Store a pointer to oa_format in oa_buffer

2022-09-06 Thread Lionel Landwerlin


On 23/08/2022 23:41, Umesh Nerlige Ramappa wrote:

DG2 introduces OA reports with 64 bit report header fields. Perf OA
would need more information about the OA format in order to process such
reports. Store all OA format info in oa_buffer instead of just the size
and format-id.

Signed-off-by: Umesh Nerlige Ramappa 

Reviewed-by: Lionel Landwerlin 

---
  drivers/gpu/drm/i915/i915_perf.c   | 23 ++-
  drivers/gpu/drm/i915/i915_perf_types.h |  3 +--
  2 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index f7621b45966c..9e455bd3bce5 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -483,7 +483,7 @@ static u32 gen7_oa_hw_tail_read(struct i915_perf_stream 
*stream)
  static bool oa_buffer_check_unlocked(struct i915_perf_stream *stream)
  {
u32 gtt_offset = i915_ggtt_offset(stream->oa_buffer.vma);
-   int report_size = stream->oa_buffer.format_size;
+   int report_size = stream->oa_buffer.format->size;
unsigned long flags;
bool pollin;
u32 hw_tail;
@@ -630,7 +630,7 @@ static int append_oa_sample(struct i915_perf_stream *stream,
size_t *offset,
const u8 *report)
  {
-   int report_size = stream->oa_buffer.format_size;
+   int report_size = stream->oa_buffer.format->size;
struct drm_i915_perf_record_header header;
int report_size_partial;
u8 *oa_buf_end;
@@ -694,7 +694,7 @@ static int gen8_append_oa_reports(struct i915_perf_stream 
*stream,
  size_t *offset)
  {
struct intel_uncore *uncore = stream->uncore;
-   int report_size = stream->oa_buffer.format_size;
+   int report_size = stream->oa_buffer.format->size;
u8 *oa_buf_base = stream->oa_buffer.vaddr;
u32 gtt_offset = i915_ggtt_offset(stream->oa_buffer.vma);
size_t start_offset = *offset;
@@ -970,7 +970,7 @@ static int gen7_append_oa_reports(struct i915_perf_stream 
*stream,
  size_t *offset)
  {
struct intel_uncore *uncore = stream->uncore;
-   int report_size = stream->oa_buffer.format_size;
+   int report_size = stream->oa_buffer.format->size;
u8 *oa_buf_base = stream->oa_buffer.vaddr;
u32 gtt_offset = i915_ggtt_offset(stream->oa_buffer.vma);
u32 mask = (OA_BUFFER_SIZE - 1);
@@ -2517,7 +2517,7 @@ static int gen12_configure_oar_context(struct 
i915_perf_stream *stream,
  {
int err;
struct intel_context *ce = stream->pinned_ctx;
-   u32 format = stream->oa_buffer.format;
+   u32 format = stream->oa_buffer.format->format;
u32 offset = stream->perf->ctx_oactxctrl_offset;
struct flex regs_context[] = {
{
@@ -2890,7 +2890,7 @@ static void gen7_oa_enable(struct i915_perf_stream 
*stream)
u32 ctx_id = stream->specific_ctx_id;
bool periodic = stream->periodic;
u32 period_exponent = stream->period_exponent;
-   u32 report_format = stream->oa_buffer.format;
+   u32 report_format = stream->oa_buffer.format->format;
  
  	/*

 * Reset buf pointers so we don't forward reports from before now.
@@ -2916,7 +2916,7 @@ static void gen7_oa_enable(struct i915_perf_stream 
*stream)
  static void gen8_oa_enable(struct i915_perf_stream *stream)
  {
struct intel_uncore *uncore = stream->uncore;
-   u32 report_format = stream->oa_buffer.format;
+   u32 report_format = stream->oa_buffer.format->format;
  
  	/*

 * Reset buf pointers so we don't forward reports from before now.
@@ -2942,7 +2942,7 @@ static void gen8_oa_enable(struct i915_perf_stream 
*stream)
  static void gen12_oa_enable(struct i915_perf_stream *stream)
  {
struct intel_uncore *uncore = stream->uncore;
-   u32 report_format = stream->oa_buffer.format;
+   u32 report_format = stream->oa_buffer.format->format;
  
  	/*

 * If we don't want OA reports from the OA buffer, then we don't even
@@ -3184,15 +3184,12 @@ static int i915_oa_stream_init(struct i915_perf_stream 
*stream,
stream->sample_flags = props->sample_flags;
stream->sample_size += format_size;
  
-	stream->oa_buffer.format_size = format_size;

-   if (drm_WARN_ON(>drm, stream->oa_buffer.format_size == 0))
+   stream->oa_buffer.format = >oa_formats[props->oa_format];
+   if (drm_WARN_ON(>drm, stream->oa_buffer.format->size == 0))
return -EINVAL;
  
  	stream->hold_preemption = props->hold_preemption;
  
-	stream->oa_buffer.format =

-   perf->oa_formats[props->oa_format].format;
-
stream->periodic = props->oa_periodic;
if (stream->periodic)
st

Re: [Intel-gfx] [PATCH 10/19] drm/i915/perf: Use gt-specific ggtt for OA and noa-wait buffers

2022-09-06 Thread Lionel Landwerlin


On 23/08/2022 23:41, Umesh Nerlige Ramappa wrote:

User passes uabi engine class and instance to the perf OA interface. Use
gt corresponding to the engine to pin the buffers to the right ggtt.

Signed-off-by: Umesh Nerlige Ramappa 


I didn't know there was a GGTT per engine.

Do I understand this correct?


Thanks,

-Lionel



---
  drivers/gpu/drm/i915/i915_perf.c | 21 +++--
  1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 87b92d2946f4..f7621b45966c 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1765,6 +1765,7 @@ static void gen12_init_oa_buffer(struct i915_perf_stream 
*stream)
  static int alloc_oa_buffer(struct i915_perf_stream *stream)
  {
struct drm_i915_private *i915 = stream->perf->i915;
+   struct intel_gt *gt = stream->engine->gt;
struct drm_i915_gem_object *bo;
struct i915_vma *vma;
int ret;
@@ -1784,11 +1785,22 @@ static int alloc_oa_buffer(struct i915_perf_stream 
*stream)
i915_gem_object_set_cache_coherency(bo, I915_CACHE_LLC);
  
  	/* PreHSW required 512K alignment, HSW requires 16M */

-   vma = i915_gem_object_ggtt_pin(bo, NULL, 0, SZ_16M, 0);
+   vma = i915_vma_instance(bo, >ggtt->vm, NULL);
if (IS_ERR(vma)) {
ret = PTR_ERR(vma);
goto err_unref;
}
+
+   /*
+* PreHSW required 512K alignment.
+* HSW and onwards, align to requested size of OA buffer.
+*/
+   ret = i915_vma_pin(vma, 0, SZ_16M, PIN_GLOBAL | PIN_HIGH);
+   if (ret) {
+   drm_err(>i915->drm, "Failed to pin OA buffer %d\n", ret);
+   goto err_unref;
+   }
+
stream->oa_buffer.vma = vma;
  
  	stream->oa_buffer.vaddr =

@@ -1838,6 +1850,7 @@ static u32 *save_restore_register(struct i915_perf_stream 
*stream, u32 *cs,
  static int alloc_noa_wait(struct i915_perf_stream *stream)
  {
struct drm_i915_private *i915 = stream->perf->i915;
+   struct intel_gt *gt = stream->engine->gt;
struct drm_i915_gem_object *bo;
struct i915_vma *vma;
const u64 delay_ticks = 0x -
@@ -1878,12 +1891,16 @@ static int alloc_noa_wait(struct i915_perf_stream 
*stream)
 * multiple OA config BOs will have a jump to this address and it
 * needs to be fixed during the lifetime of the i915/perf stream.
 */
-   vma = i915_gem_object_ggtt_pin_ww(bo, , NULL, 0, 0, PIN_HIGH);
+   vma = i915_vma_instance(bo, >ggtt->vm, NULL);
if (IS_ERR(vma)) {
ret = PTR_ERR(vma);
goto out_ww;
}
  
+	ret = i915_vma_pin_ww(vma, , 0, 0, PIN_GLOBAL | PIN_HIGH);

+   if (ret)
+   goto out_ww;
+
batch = cs = i915_gem_object_pin_map(bo, I915_MAP_WB);
if (IS_ERR(batch)) {
ret = PTR_ERR(batch);

Re: [Intel-gfx] [PATCH 08/19] drm/i915/perf: Move gt-specific data from i915->perf to gt->perf

2022-09-06 Thread Lionel Landwerlin


On 23/08/2022 23:41, Umesh Nerlige Ramappa wrote:

Make perf part of gt as the OAG buffer is specific to a gt. The refactor
eventually simplifies programming the right OA buffer and the right HW
registers when supporting multiple gts.

Signed-off-by: Umesh Nerlige Ramappa 

Reviewed-by: Lionel Landwerlin 

---
  drivers/gpu/drm/i915/gt/intel_gt_types.h   |  3 +
  drivers/gpu/drm/i915/gt/intel_sseu.c   |  4 +-
  drivers/gpu/drm/i915/i915_perf.c   | 75 +-
  drivers/gpu/drm/i915/i915_perf_types.h | 39 +--
  drivers/gpu/drm/i915/selftests/i915_perf.c | 16 +++--
  5 files changed, 80 insertions(+), 57 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h 
b/drivers/gpu/drm/i915/gt/intel_gt_types.h
index 4d56f7d5a3be..3d079d206cec 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
@@ -20,6 +20,7 @@
  #include "intel_gsc.h"
  
  #include "i915_vma.h"

+#include "i915_perf_types.h"
  #include "intel_engine_types.h"
  #include "intel_gt_buffer_pool_types.h"
  #include "intel_hwconfig.h"
@@ -260,6 +261,8 @@ struct intel_gt {
/* sysfs defaults per gt */
struct gt_defaults defaults;
struct kobject *sysfs_defaults;
+
+   struct i915_perf_gt perf;
  };
  
  enum intel_gt_scratch_field {

diff --git a/drivers/gpu/drm/i915/gt/intel_sseu.c 
b/drivers/gpu/drm/i915/gt/intel_sseu.c
index c6d3050604c8..fcaf3c58b554 100644
--- a/drivers/gpu/drm/i915/gt/intel_sseu.c
+++ b/drivers/gpu/drm/i915/gt/intel_sseu.c
@@ -678,8 +678,8 @@ u32 intel_sseu_make_rpcs(struct intel_gt *gt,
 * If i915/perf is active, we want a stable powergating configuration
 * on the system. Use the configuration pinned by i915/perf.
 */
-   if (i915->perf.exclusive_stream)
-   req_sseu = >perf.sseu;
+   if (gt->perf.exclusive_stream)
+   req_sseu = >perf.sseu;
  
  	slices = hweight8(req_sseu->slice_mask);

subslices = hweight8(req_sseu->subslice_mask);
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 3e3bda147c48..5dccb3c5 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1577,8 +1577,9 @@ free_noa_wait(struct i915_perf_stream *stream)
  static void i915_oa_stream_destroy(struct i915_perf_stream *stream)
  {
struct i915_perf *perf = stream->perf;
+   struct intel_gt *gt = stream->engine->gt;
  
-	BUG_ON(stream != perf->exclusive_stream);

+   BUG_ON(stream != gt->perf.exclusive_stream);
  
  	/*

 * Unset exclusive_stream first, it will be checked while disabling
@@ -1586,7 +1587,7 @@ static void i915_oa_stream_destroy(struct 
i915_perf_stream *stream)
 *
 * See i915_oa_init_reg_state() and lrc_configure_all_contexts()
 */
-   WRITE_ONCE(perf->exclusive_stream, NULL);
+   WRITE_ONCE(gt->perf.exclusive_stream, NULL);
perf->ops.disable_metric_set(stream);
  
  	free_oa_buffer(stream);

@@ -2579,10 +2580,11 @@ oa_configure_all_contexts(struct i915_perf_stream 
*stream,
  {
struct drm_i915_private *i915 = stream->perf->i915;
struct intel_engine_cs *engine;
+   struct intel_gt *gt = stream->engine->gt;
struct i915_gem_context *ctx, *cn;
int err;
  
-	lockdep_assert_held(>perf->lock);

+   lockdep_assert_held(>perf.lock);
  
  	/*

 * The OA register config is setup through the context image. This image
@@ -3103,6 +3105,7 @@ static int i915_oa_stream_init(struct i915_perf_stream 
*stream,
  {
struct drm_i915_private *i915 = stream->perf->i915;
struct i915_perf *perf = stream->perf;
+   struct intel_gt *gt;
int format_size;
int ret;
  
@@ -3111,6 +3114,7 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,

"OA engine not specified\n");
return -EINVAL;
}
+   gt = props->engine->gt;
  
  	/*

 * If the sysfs metrics/ directory wasn't registered for some
@@ -3141,7 +3145,7 @@ static int i915_oa_stream_init(struct i915_perf_stream 
*stream,
 * counter reports and marshal to the appropriate client
 * we currently only allow exclusive access
 */
-   if (perf->exclusive_stream) {
+   if (gt->perf.exclusive_stream) {
drm_dbg(>perf->i915->drm,
"OA unit already in use\n");
return -EBUSY;
@@ -3221,8 +3225,8 @@ static int i915_oa_stream_init(struct i915_perf_stream 
*stream,
  
  	stream->ops = _oa_stream_ops;
  
-	perf->sseu = props->sseu;

-   WRITE_ONCE(perf->exclusive_stream, stream);
+   stream->engine->gt->perf.sseu = props->sseu;
+   WRITE_ONCE(gt->perf.exclusive_stream,

Re: [Intel-gfx] [PATCH 07/19] drm/i915/perf: Simply use stream->ctx

2022-09-06 Thread Lionel Landwerlin


On 23/08/2022 23:41, Umesh Nerlige Ramappa wrote:

Earlier code used exclusive_stream to check for user passed context.
Simplify this by accessing stream->ctx.

Signed-off-by: Umesh Nerlige Ramappa 

Reviewed-by: Lionel Landwerlin 

---
  drivers/gpu/drm/i915/i915_perf.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index bbf1c574f393..3e3bda147c48 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -801,7 +801,7 @@ static int gen8_append_oa_reports(struct i915_perf_stream 
*stream,
 * switches since it's not-uncommon for periodic samples to
 * identify a switch before any 'context switch' report.
 */
-   if (!stream->perf->exclusive_stream->ctx ||
+   if (!stream->ctx ||
stream->specific_ctx_id == ctx_id ||
stream->oa_buffer.last_ctx_id == stream->specific_ctx_id ||
reason & OAREPORT_REASON_CTX_SWITCH) {
@@ -810,7 +810,7 @@ static int gen8_append_oa_reports(struct i915_perf_stream 
*stream,
 * While filtering for a single context we avoid
 * leaking the IDs of other contexts.
 */
-   if (stream->perf->exclusive_stream->ctx &&
+   if (stream->ctx &&
stream->specific_ctx_id != ctx_id) {
report32[2] = INVALID_CTX_ID;
}

Re: [Intel-gfx] [PATCH 05/19] drm/i915/perf: Enable commands per clock reporting in OA

2022-09-06 Thread Lionel Landwerlin


On 23/08/2022 23:41, Umesh Nerlige Ramappa wrote:

XEHPSDV and DG2 provide a way to configure bytes per clock vs commands
per clock reporting. Enable command per clock setting on enabling OA.

Signed-off-by: Umesh Nerlige Ramappa 

Acked-by: Lionel Landwerlin 

---
  drivers/gpu/drm/i915/i915_drv.h  |  3 +++
  drivers/gpu/drm/i915/i915_pci.c  |  1 +
  drivers/gpu/drm/i915/i915_perf.c | 20 
  drivers/gpu/drm/i915/i915_perf_oa_regs.h |  4 
  drivers/gpu/drm/i915/intel_device_info.h |  1 +
  5 files changed, 29 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index b4733c5a01da..b2e8a44bd976 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1287,6 +1287,9 @@ IS_SUBPLATFORM(const struct drm_i915_private *i915,
  #define HAS_RUNTIME_PM(dev_priv) (INTEL_INFO(dev_priv)->has_runtime_pm)
  #define HAS_64BIT_RELOC(dev_priv) (INTEL_INFO(dev_priv)->has_64bit_reloc)
  
+#define HAS_OA_BPC_REPORTING(dev_priv) \

+   (INTEL_INFO(dev_priv)->has_oa_bpc_reporting)
+
  /*
   * Set this flag, when platform requires 64K GTT page sizes or larger for
   * device local memory access.
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index d8446bb25d5e..bd0b8502b91e 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -1019,6 +1019,7 @@ static const struct intel_device_info adl_p_info = {
.has_logical_ring_contexts = 1, \
.has_logical_ring_elsq = 1, \
.has_mslice_steering = 1, \
+   .has_oa_bpc_reporting = 1, \
.has_rc6 = 1, \
.has_reset_engine = 1, \
.has_rps = 1, \
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index efa7eda83edd..6fc4f0d8fc5a 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -2745,10 +2745,12 @@ static int
  gen12_enable_metric_set(struct i915_perf_stream *stream,
struct i915_active *active)
  {
+   struct drm_i915_private *i915 = stream->perf->i915;
struct intel_uncore *uncore = stream->uncore;
struct i915_oa_config *oa_config = stream->oa_config;
bool periodic = stream->periodic;
u32 period_exponent = stream->period_exponent;
+   u32 sqcnt1;
int ret;
  
  	intel_uncore_write(uncore, GEN12_OAG_OA_DEBUG,

@@ -2767,6 +2769,16 @@ gen12_enable_metric_set(struct i915_perf_stream *stream,
(period_exponent << 
GEN12_OAG_OAGLBCTXCTRL_TIMER_PERIOD_SHIFT))
: 0);
  
+ 	/*

+* Initialize Super Queue Internal Cnt Register
+* Set PMON Enable in order to collect valid metrics.
+* Enable commands per clock reporting in OA for XEHPSDV onward.
+*/
+   sqcnt1 = GEN12_SQCNT1_PMON_ENABLE |
+(HAS_OA_BPC_REPORTING(i915) ? GEN12_SQCNT1_OABPC : 0);
+
+   intel_uncore_rmw(uncore, GEN12_SQCNT1, 0, sqcnt1);
+
/*
 * Update all contexts prior writing the mux configurations as we need
 * to make sure all slices/subslices are ON before writing to NOA
@@ -2816,6 +2828,8 @@ static void gen11_disable_metric_set(struct 
i915_perf_stream *stream)
  static void gen12_disable_metric_set(struct i915_perf_stream *stream)
  {
struct intel_uncore *uncore = stream->uncore;
+   struct drm_i915_private *i915 = stream->perf->i915;
+   u32 sqcnt1;
  
  	/* Reset all contexts' slices/subslices configurations. */

gen12_configure_all_contexts(stream, NULL, NULL);
@@ -2826,6 +2840,12 @@ static void gen12_disable_metric_set(struct 
i915_perf_stream *stream)
  
  	/* Make sure we disable noa to save power. */

intel_uncore_rmw(uncore, RPM_CONFIG1, GEN10_GT_NOA_ENABLE, 0);
+
+   sqcnt1 = GEN12_SQCNT1_PMON_ENABLE |
+(HAS_OA_BPC_REPORTING(i915) ? GEN12_SQCNT1_OABPC : 0);
+
+   /* Reset PMON Enable to save power. */
+   intel_uncore_rmw(uncore, GEN12_SQCNT1, sqcnt1, 0);
  }
  
  static void gen7_oa_enable(struct i915_perf_stream *stream)

diff --git a/drivers/gpu/drm/i915/i915_perf_oa_regs.h 
b/drivers/gpu/drm/i915/i915_perf_oa_regs.h
index 0ef3562ff4aa..381d94101610 100644
--- a/drivers/gpu/drm/i915/i915_perf_oa_regs.h
+++ b/drivers/gpu/drm/i915/i915_perf_oa_regs.h
@@ -134,4 +134,8 @@
  #define GDT_CHICKEN_BITS_MMIO(0x9840)
  #define   GT_NOA_ENABLE   0x0080
  
+#define GEN12_SQCNT1_MMIO(0x8718)

+#define   GEN12_SQCNT1_PMON_ENABLE REG_BIT(30)
+#define   GEN12_SQCNT1_OABPC   REG_BIT(29)
+
  #endif /* __INTEL_PERF_OA_REGS__ */
diff --git a/drivers/gpu/drm/i915/intel_device_info.h 
b/drivers/gpu/drm/i915/intel_device_info.h
index 23bf230aa104..fc2a0660426e 100644
--- a/drivers/gpu/drm/i915/intel_device_info.h
+++ b/drivers/gpu/drm/i915/intel_device_info.h
@@ -163,6 +163,7 @@ enum intel_ppgtt

Re: [Intel-gfx] [PATCH 04/19] drm/i915/perf: Determine gen12 oa ctx offset at runtime

2022-09-06 Thread Lionel Landwerlin


On 23/08/2022 23:41, Umesh Nerlige Ramappa wrote:

Some SKUs of same gen12 platform may have different oactxctrl
offsets. For gen12, determine oactxctrl offsets at runtime.

Signed-off-by: Umesh Nerlige Ramappa 
---
  drivers/gpu/drm/i915/i915_perf.c | 149 ++-
  drivers/gpu/drm/i915/i915_perf_oa_regs.h |   2 +-
  2 files changed, 120 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 3526693d64fa..efa7eda83edd 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1363,6 +1363,67 @@ static int gen12_get_render_context_id(struct 
i915_perf_stream *stream)
return 0;
  }
  
+#define MI_OPCODE(x) (((x) >> 23) & 0x3f)

+#define IS_MI_LRI_CMD(x) (MI_OPCODE(x) == MI_OPCODE(MI_INSTR(0x22, 0)))
+#define MI_LRI_LEN(x) (((x) & 0xff) + 1)



Maybe you want to put this in intel_gpu_commands.h



+#define __valid_oactxctrl_offset(x) ((x) && (x) != U32_MAX)
+static bool __find_reg_in_lri(u32 *state, u32 reg, u32 *offset)
+{
+   u32 idx = *offset;
+   u32 len = MI_LRI_LEN(state[idx]) + idx;
+
+   idx++;
+   for (; idx < len; idx += 2)
+   if (state[idx] == reg)
+   break;
+
+   *offset = idx;
+   return state[idx] == reg;
+}
+
+static u32 __context_image_offset(struct intel_context *ce, u32 reg)
+{
+   u32 offset, len = (ce->engine->context_size - PAGE_SIZE) / 4;
+   u32 *state = ce->lrc_reg_state;
+
+   for (offset = 0; offset < len; ) {
+   if (IS_MI_LRI_CMD(state[offset])) {


I'm a bit concerned you might find other matches with this.

Because let's say you run into a 3DSTATE_SUBSLICE_HASH_TABLE 
instruction, you'll iterate the instruction dword by dword because you 
don't know how to read its length and skip to the next one.


Now some of the fields can be programmed from userspace to look like an 
MI_LRI header, so you start to read data in the wrong way.



Unfortunately I don't have a better solution. My only ask is that you 
make __find_reg_in_lri() take the context image size in parameter so it 
NEVER goes over the the context image.



To limit the risk you should run this function only one at driver 
initialization and store the found offset.



Thanks,


-Lionel



+   if (__find_reg_in_lri(state, reg, ))
+   break;
+   } else {
+   offset++;
+   }
+   }
+
+   return offset < len ? offset : U32_MAX;
+}
+
+static int __set_oa_ctx_ctrl_offset(struct intel_context *ce)
+{
+   i915_reg_t reg = GEN12_OACTXCONTROL(ce->engine->mmio_base);
+   struct i915_perf *perf = >engine->i915->perf;
+   u32 saved_offset = perf->ctx_oactxctrl_offset;
+   u32 offset;
+
+   /* Do this only once. Failure is stored as offset of U32_MAX */
+   if (saved_offset)
+   return 0;
+
+   offset = __context_image_offset(ce, i915_mmio_reg_offset(reg));
+   perf->ctx_oactxctrl_offset = offset;
+
+   drm_dbg(>engine->i915->drm,
+   "%s oa ctx control at 0x%08x dword offset\n",
+   ce->engine->name, offset);
+
+   return __valid_oactxctrl_offset(offset) ? 0 : -ENODEV;
+}
+
+static bool engine_supports_mi_query(struct intel_engine_cs *engine)
+{
+   return engine->class == RENDER_CLASS;
+}
+
  /**
   * oa_get_render_ctx_id - determine and hold ctx hw id
   * @stream: An i915-perf stream opened for OA metrics
@@ -1382,6 +1443,17 @@ static int oa_get_render_ctx_id(struct i915_perf_stream 
*stream)
if (IS_ERR(ce))
return PTR_ERR(ce);
  
+	if (engine_supports_mi_query(stream->engine)) {

+   ret = __set_oa_ctx_ctrl_offset(ce);
+   if (ret && !(stream->sample_flags & SAMPLE_OA_REPORT)) {
+   intel_context_unpin(ce);
+   drm_err(>perf->i915->drm,
+   "Enabling perf query failed for %s\n",
+   stream->engine->name);
+   return ret;
+   }
+   }
+
switch (GRAPHICS_VER(ce->engine->i915)) {
case 7: {
/*
@@ -2412,10 +2484,11 @@ static int gen12_configure_oar_context(struct 
i915_perf_stream *stream,
int err;
struct intel_context *ce = stream->pinned_ctx;
u32 format = stream->oa_buffer.format;
+   u32 offset = stream->perf->ctx_oactxctrl_offset;
struct flex regs_context[] = {
{
GEN8_OACTXCONTROL,
-   stream->perf->ctx_oactxctrl_offset + 1,
+   offset + 1,
active ? GEN8_OA_COUNTER_RESUME : 0,
},
};
@@ -2440,15 +2513,18 @@ static int gen12_configure_oar_context(struct 
i915_perf_stream *stream,
},
};
  
-	/* Modify the context image of pinned context with regs_context*/

-

Re: [Intel-gfx] [PATCH 02/19] drm/i915/perf: Add OA formats for DG2

2022-09-06 Thread Lionel Landwerlin


On 23/08/2022 23:41, Umesh Nerlige Ramappa wrote:

Add new OA formats for DG2. Some of the newer OA formats are not
multples of 64 bytes and are not powers of 2. For those formats, adjust
hw_tail accordingly when checking for new reports.

Signed-off-by: Umesh Nerlige Ramappa 


Apart from the coding style issue :


Reviewed-by: Lionel Landwerlin 



---
  drivers/gpu/drm/i915/i915_perf.c | 63 
  include/uapi/drm/i915_drm.h  |  6 +++
  2 files changed, 46 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 735244a3aedd..c8331b549d31 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -306,7 +306,8 @@ static u32 i915_oa_max_sample_rate = 10;
  
  /* XXX: beware if future OA HW adds new report formats that the current

   * code assumes all reports have a power-of-two size and ~(size - 1) can
- * be used as a mask to align the OA tail pointer.
+ * be used as a mask to align the OA tail pointer. In some of the
+ * formats, R is used to denote reserved field.
   */
  static const struct i915_oa_format oa_formats[I915_OA_FORMAT_MAX] = {
[I915_OA_FORMAT_A13]= { 0, 64 },
@@ -320,6 +321,10 @@ static const struct i915_oa_format 
oa_formats[I915_OA_FORMAT_MAX] = {
[I915_OA_FORMAT_A12]= { 0, 64 },
[I915_OA_FORMAT_A12_B8_C8]  = { 2, 128 },
[I915_OA_FORMAT_A32u40_A4u32_B8_C8] = { 5, 256 },
+   [I915_OAR_FORMAT_A32u40_A4u32_B8_C8]= { 5, 256 },
+   [I915_OA_FORMAT_A24u40_A14u32_B8_C8]= { 5, 256 },
+   [I915_OAR_FORMAT_A36u64_B8_C8]  = { 1, 384 },
+   [I915_OA_FORMAT_A38u64_R2u64_B8_C8] = { 1, 448 },
  };
  
  #define SAMPLE_OA_REPORT  (1<<0)

@@ -467,6 +472,7 @@ static bool oa_buffer_check_unlocked(struct 
i915_perf_stream *stream)
bool pollin;
u32 hw_tail;
u64 now;
+   u32 partial_report_size;
  
  	/* We have to consider the (unlikely) possibility that read() errors

 * could result in an OA buffer reset which might reset the head and
@@ -476,10 +482,16 @@ static bool oa_buffer_check_unlocked(struct 
i915_perf_stream *stream)
  
  	hw_tail = stream->perf->ops.oa_hw_tail_read(stream);
  
-	/* The tail pointer increases in 64 byte increments,

-* not in report_size steps...
+   /* The tail pointer increases in 64 byte increments, whereas report
+* sizes need not be integral multiples or 64 or powers of 2.
+* Compute potentially partially landed report in the OA buffer
 */
-   hw_tail &= ~(report_size - 1);
+   partial_report_size = OA_TAKEN(hw_tail, stream->oa_buffer.tail);
+   partial_report_size %= report_size;
+
+   /* Subtract partial amount off the tail */
+   hw_tail = gtt_offset + ((hw_tail - partial_report_size) &
+   (stream->oa_buffer.vma->size - 1));
  
  	now = ktime_get_mono_fast_ns();
  
@@ -601,6 +613,8 @@ static int append_oa_sample(struct i915_perf_stream *stream,

  {
int report_size = stream->oa_buffer.format_size;
struct drm_i915_perf_record_header header;
+   int report_size_partial;
+   u8 *oa_buf_end;
  
  	header.type = DRM_I915_PERF_RECORD_SAMPLE;

header.pad = 0;
@@ -614,7 +628,19 @@ static int append_oa_sample(struct i915_perf_stream 
*stream,
return -EFAULT;
buf += sizeof(header);
  
-	if (copy_to_user(buf, report, report_size))

+   oa_buf_end = stream->oa_buffer.vaddr +
+stream->oa_buffer.vma->size;
+   report_size_partial = oa_buf_end - report;
+
+   if (report_size_partial < report_size) {
+   if(copy_to_user(buf, report, report_size_partial))
+   return -EFAULT;
+   buf += report_size_partial;
+
+   if(copy_to_user(buf, stream->oa_buffer.vaddr,
+   report_size - report_size_partial))
+   return -EFAULT;


I think the coding style requires you to use if () not if()


Just a suggestion : you could make this code deal with the partial bit 
as the main bit of the function :



oa_buf_end = stream->oa_buffer.vaddr +
 stream->oa_buffer.vma->size;

report_size_partial = oa_buf_end - report;

if (copy_to_user(buf, report, report_size_partial))
return -EFAULT;
buf += report_size_partial;

if (report_size_partial < report_size &&
copy_to_user(buf, stream->oa_buffer.vaddr,
report_size - report_size_partial))
return -EFAULT;
buf += report_size - report_size_partial;



+   } else if (copy_to_user(buf, report, report_size))
return -EFAULT;
  
  	(*offset) += header.size;

@@ -684,8 +710,8 @@ static int gen8_append_oa_reports(struct i915_perf_stream 
*stream,
 * all a power of two).
 *

Re: [Intel-gfx] [PATCH 01/19] drm/i915/perf: Fix OA filtering logic for GuC mode

2022-09-06 Thread Lionel Landwerlin


On 06/09/2022 20:39, Umesh Nerlige Ramappa wrote:

On Tue, Sep 06, 2022 at 05:33:00PM +0300, Lionel Landwerlin wrote:

On 23/08/2022 23:41, Umesh Nerlige Ramappa wrote:
With GuC mode of submission, GuC is in control of defining the 
context id field
that is part of the OA reports. To filter reports, UMD and KMD must 
know what sw
context id was chosen by GuC. There is not interface between KMD and 
GuC to
determine this, so read the upper-dword of EXECLIST_STATUS to 
filter/squash OA

reports for the specific context.

Signed-off-by: Umesh Nerlige Ramappa 



I assume you checked with GuC that this doesn't change as the context 
is running?


Correct.



With i915/execlist submission mode, we had to ask i915 to pin the 
sw_id/ctx_id.




From GuC perspective, the context id can change once KMD de-registers 
the context and that will not happen while the context is in use.


Thanks,
Umesh



Thanks Umesh,


Maybe I should have been more precise in my question :


Can the ID change while the i915-perf stream is opened?

Because the ID not changing while the context is running makes sense.

But since the number of available IDs is limited to 2k or something on 
Gfx12, it's possible the GuC has to reuse IDs if too many apps want to 
run during the period of time while i915-perf is active and filtering.



-Lionel






If that's not the case then filtering is broken.


-Lionel



---
 drivers/gpu/drm/i915/gt/intel_lrc.h |   2 +
 drivers/gpu/drm/i915/i915_perf.c    | 141 
 2 files changed, 124 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.h 
b/drivers/gpu/drm/i915/gt/intel_lrc.h

index a390f0813c8b..7111bae759f3 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.h
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.h
@@ -110,6 +110,8 @@ enum {
 #define XEHP_SW_CTX_ID_WIDTH    16
 #define XEHP_SW_COUNTER_SHIFT    58
 #define XEHP_SW_COUNTER_WIDTH    6
+#define GEN12_GUC_SW_CTX_ID_SHIFT    39
+#define GEN12_GUC_SW_CTX_ID_WIDTH    16
 static inline void lrc_runtime_start(struct intel_context *ce)
 {
diff --git a/drivers/gpu/drm/i915/i915_perf.c 
b/drivers/gpu/drm/i915/i915_perf.c

index f3c23fe9ad9c..735244a3aedd 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1233,6 +1233,125 @@ static struct intel_context 
*oa_pin_context(struct i915_perf_stream *stream)

 return stream->pinned_ctx;
 }
+static int
+__store_reg_to_mem(struct i915_request *rq, i915_reg_t reg, u32 
ggtt_offset)

+{
+    u32 *cs, cmd;
+
+    cmd = MI_STORE_REGISTER_MEM | MI_SRM_LRM_GLOBAL_GTT;
+    if (GRAPHICS_VER(rq->engine->i915) >= 8)
+    cmd++;
+
+    cs = intel_ring_begin(rq, 4);
+    if (IS_ERR(cs))
+    return PTR_ERR(cs);
+
+    *cs++ = cmd;
+    *cs++ = i915_mmio_reg_offset(reg);
+    *cs++ = ggtt_offset;
+    *cs++ = 0;
+
+    intel_ring_advance(rq, cs);
+
+    return 0;
+}
+
+static int
+__read_reg(struct intel_context *ce, i915_reg_t reg, u32 ggtt_offset)
+{
+    struct i915_request *rq;
+    int err;
+
+    rq = i915_request_create(ce);
+    if (IS_ERR(rq))
+    return PTR_ERR(rq);
+
+    i915_request_get(rq);
+
+    err = __store_reg_to_mem(rq, reg, ggtt_offset);
+
+    i915_request_add(rq);
+    if (!err && i915_request_wait(rq, 0, HZ / 2) < 0)
+    err = -ETIME;
+
+    i915_request_put(rq);
+
+    return err;
+}
+
+static int
+gen12_guc_sw_ctx_id(struct intel_context *ce, u32 *ctx_id)
+{
+    struct i915_vma *scratch;
+    u32 *val;
+    int err;
+
+    scratch = 
__vm_create_scratch_for_read_pinned(>engine->gt->ggtt->vm, 4);

+    if (IS_ERR(scratch))
+    return PTR_ERR(scratch);
+
+    err = i915_vma_sync(scratch);
+    if (err)
+    goto err_scratch;
+
+    err = __read_reg(ce, 
RING_EXECLIST_STATUS_HI(ce->engine->mmio_base),

+ i915_ggtt_offset(scratch));
+    if (err)
+    goto err_scratch;
+
+    val = i915_gem_object_pin_map_unlocked(scratch->obj, I915_MAP_WB);
+    if (IS_ERR(val)) {
+    err = PTR_ERR(val);
+    goto err_scratch;
+    }
+
+    *ctx_id = *val;
+    i915_gem_object_unpin_map(scratch->obj);
+
+err_scratch:
+    i915_vma_unpin_and_release(, 0);
+    return err;
+}
+
+/*
+ * For execlist mode of submission, pick an unused context id
+ * 0 - (NUM_CONTEXT_TAG -1) are used by other contexts
+ * XXX_MAX_CONTEXT_HW_ID is used by idle context
+ *
+ * For GuC mode of submission read context id from the upper dword 
of the

+ * EXECLIST_STATUS register.
+ */
+static int gen12_get_render_context_id(struct i915_perf_stream 
*stream)

+{
+    u32 ctx_id, mask;
+    int ret;
+
+    if (intel_engine_uses_guc(stream->engine)) {
+    ret = gen12_guc_sw_ctx_id(stream->pinned_ctx, _id);
+    if (ret)
+    return ret;
+
+    mask = ((1U << GEN12_GUC_SW_CTX_ID_WIDTH) - 1) <<
+    (GEN12_GUC_SW_CTX_ID_SHIFT - 32);
+    } else if (GRAPHICS_VER_FULL(stream->engine->

Re: [Intel-gfx] [PATCH 01/19] drm/i915/perf: Fix OA filtering logic for GuC mode

2022-09-06 Thread Lionel Landwerlin


On 23/08/2022 23:41, Umesh Nerlige Ramappa wrote:

With GuC mode of submission, GuC is in control of defining the context id field
that is part of the OA reports. To filter reports, UMD and KMD must know what sw
context id was chosen by GuC. There is not interface between KMD and GuC to
determine this, so read the upper-dword of EXECLIST_STATUS to filter/squash OA
reports for the specific context.

Signed-off-by: Umesh Nerlige Ramappa 



I assume you checked with GuC that this doesn't change as the context is 
running?


With i915/execlist submission mode, we had to ask i915 to pin the 
sw_id/ctx_id.



If that's not the case then filtering is broken.


-Lionel



---
  drivers/gpu/drm/i915/gt/intel_lrc.h |   2 +
  drivers/gpu/drm/i915/i915_perf.c| 141 
  2 files changed, 124 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.h 
b/drivers/gpu/drm/i915/gt/intel_lrc.h
index a390f0813c8b..7111bae759f3 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.h
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.h
@@ -110,6 +110,8 @@ enum {
  #define XEHP_SW_CTX_ID_WIDTH  16
  #define XEHP_SW_COUNTER_SHIFT 58
  #define XEHP_SW_COUNTER_WIDTH 6
+#define GEN12_GUC_SW_CTX_ID_SHIFT  39
+#define GEN12_GUC_SW_CTX_ID_WIDTH  16
  
  static inline void lrc_runtime_start(struct intel_context *ce)

  {
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index f3c23fe9ad9c..735244a3aedd 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1233,6 +1233,125 @@ static struct intel_context *oa_pin_context(struct 
i915_perf_stream *stream)
return stream->pinned_ctx;
  }
  
+static int

+__store_reg_to_mem(struct i915_request *rq, i915_reg_t reg, u32 ggtt_offset)
+{
+   u32 *cs, cmd;
+
+   cmd = MI_STORE_REGISTER_MEM | MI_SRM_LRM_GLOBAL_GTT;
+   if (GRAPHICS_VER(rq->engine->i915) >= 8)
+   cmd++;
+
+   cs = intel_ring_begin(rq, 4);
+   if (IS_ERR(cs))
+   return PTR_ERR(cs);
+
+   *cs++ = cmd;
+   *cs++ = i915_mmio_reg_offset(reg);
+   *cs++ = ggtt_offset;
+   *cs++ = 0;
+
+   intel_ring_advance(rq, cs);
+
+   return 0;
+}
+
+static int
+__read_reg(struct intel_context *ce, i915_reg_t reg, u32 ggtt_offset)
+{
+   struct i915_request *rq;
+   int err;
+
+   rq = i915_request_create(ce);
+   if (IS_ERR(rq))
+   return PTR_ERR(rq);
+
+   i915_request_get(rq);
+
+   err = __store_reg_to_mem(rq, reg, ggtt_offset);
+
+   i915_request_add(rq);
+   if (!err && i915_request_wait(rq, 0, HZ / 2) < 0)
+   err = -ETIME;
+
+   i915_request_put(rq);
+
+   return err;
+}
+
+static int
+gen12_guc_sw_ctx_id(struct intel_context *ce, u32 *ctx_id)
+{
+   struct i915_vma *scratch;
+   u32 *val;
+   int err;
+
+   scratch = 
__vm_create_scratch_for_read_pinned(>engine->gt->ggtt->vm, 4);
+   if (IS_ERR(scratch))
+   return PTR_ERR(scratch);
+
+   err = i915_vma_sync(scratch);
+   if (err)
+   goto err_scratch;
+
+   err = __read_reg(ce, RING_EXECLIST_STATUS_HI(ce->engine->mmio_base),
+i915_ggtt_offset(scratch));
+   if (err)
+   goto err_scratch;
+
+   val = i915_gem_object_pin_map_unlocked(scratch->obj, I915_MAP_WB);
+   if (IS_ERR(val)) {
+   err = PTR_ERR(val);
+   goto err_scratch;
+   }
+
+   *ctx_id = *val;
+   i915_gem_object_unpin_map(scratch->obj);
+
+err_scratch:
+   i915_vma_unpin_and_release(, 0);
+   return err;
+}
+
+/*
+ * For execlist mode of submission, pick an unused context id
+ * 0 - (NUM_CONTEXT_TAG -1) are used by other contexts
+ * XXX_MAX_CONTEXT_HW_ID is used by idle context
+ *
+ * For GuC mode of submission read context id from the upper dword of the
+ * EXECLIST_STATUS register.
+ */
+static int gen12_get_render_context_id(struct i915_perf_stream *stream)
+{
+   u32 ctx_id, mask;
+   int ret;
+
+   if (intel_engine_uses_guc(stream->engine)) {
+   ret = gen12_guc_sw_ctx_id(stream->pinned_ctx, _id);
+   if (ret)
+   return ret;
+
+   mask = ((1U << GEN12_GUC_SW_CTX_ID_WIDTH) - 1) <<
+   (GEN12_GUC_SW_CTX_ID_SHIFT - 32);
+   } else if (GRAPHICS_VER_FULL(stream->engine->i915) >= IP_VER(12, 50)) {
+   ctx_id = (XEHP_MAX_CONTEXT_HW_ID - 1) <<
+   (XEHP_SW_CTX_ID_SHIFT - 32);
+
+   mask = ((1U << XEHP_SW_CTX_ID_WIDTH) - 1) <<
+   (XEHP_SW_CTX_ID_SHIFT - 32);
+   } else {
+   ctx_id = (GEN12_MAX_CONTEXT_HW_ID - 1) <<
+(GEN11_SW_CTX_ID_SHIFT - 32);
+
+   mask = ((1U << GEN11_SW_CTX_ID_WIDTH) - 1) <<
+   (GEN11_SW_CTX_ID_SHIFT - 32);
+   }
+

Re: [Intel-gfx] [PATCH v3] drm/i915/dg2: Add performance workaround 18019455067

2022-07-20 Thread Lionel Landwerlin


Ping?

On 11/07/2022 14:30, Lionel Landwerlin wrote:

Ping?

On 30/06/2022 11:35, Lionel Landwerlin wrote:

The recommended number of stackIDs for Ray Tracing subsystem is 512
rather than 2048 (default HW programming).

v2: Move the programming to dg2_ctx_gt_tuning_init() (Lucas)

v3: Move programming to general_render_compute_wa_init() (Matt)

Signed-off-by: Lionel Landwerlin 
---
  drivers/gpu/drm/i915/gt/intel_gt_regs.h | 4 
  drivers/gpu/drm/i915/gt/intel_workarounds.c | 9 +
  2 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h

index 07ef111947b8c..12fc87b957425 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1112,6 +1112,10 @@
  #define   GEN12_PUSH_CONST_DEREF_HOLD_DIS    REG_BIT(8)
    #define RT_CTRL    _MMIO(0xe530)
+#define   RT_CTRL_NUMBER_OF_STACKIDS_MASK    REG_GENMASK(6, 5)
+#define   NUMBER_OF_STACKIDS_512    2
+#define   NUMBER_OF_STACKIDS_1024    1
+#define   NUMBER_OF_STACKIDS_2048    0
  #define   DIS_NULL_QUERY    REG_BIT(10)
    #define EU_PERF_CNTL1    _MMIO(0xe558)
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c

index 3213c593a55f4..ea674e456cd76 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -2737,6 +2737,15 @@ general_render_compute_wa_init(struct 
intel_engine_cs *engine, struct i915_wa_li

  wa_write_or(wal, VDBX_MOD_CTRL, FORCE_MISS_FTLB);
  wa_write_or(wal, VEBX_MOD_CTRL, FORCE_MISS_FTLB);
  }
+
+    if (IS_DG2(i915)) {
+    /* Performance tuning for Ray-tracing */
+    wa_write_clr_set(wal,
+ RT_CTRL,
+ RT_CTRL_NUMBER_OF_STACKIDS_MASK,
+ REG_FIELD_PREP(RT_CTRL_NUMBER_OF_STACKIDS_MASK,
+    NUMBER_OF_STACKIDS_512));
+    }
  }
    static void

Re: [Intel-gfx] [PATCH v3] drm/i915/dg2: Add performance workaround 18019455067

2022-07-11 Thread Lionel Landwerlin


Ping?

On 30/06/2022 11:35, Lionel Landwerlin wrote:

The recommended number of stackIDs for Ray Tracing subsystem is 512
rather than 2048 (default HW programming).

v2: Move the programming to dg2_ctx_gt_tuning_init() (Lucas)

v3: Move programming to general_render_compute_wa_init() (Matt)

Signed-off-by: Lionel Landwerlin 
---
  drivers/gpu/drm/i915/gt/intel_gt_regs.h | 4 
  drivers/gpu/drm/i915/gt/intel_workarounds.c | 9 +
  2 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index 07ef111947b8c..12fc87b957425 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1112,6 +1112,10 @@
  #define   GEN12_PUSH_CONST_DEREF_HOLD_DIS REG_BIT(8)
  
  #define RT_CTRL	_MMIO(0xe530)

+#define   RT_CTRL_NUMBER_OF_STACKIDS_MASK  REG_GENMASK(6, 5)
+#define   NUMBER_OF_STACKIDS_512   2
+#define   NUMBER_OF_STACKIDS_1024  1
+#define   NUMBER_OF_STACKIDS_2048  0
  #define   DIS_NULL_QUERY  REG_BIT(10)
  
  #define EU_PERF_CNTL1_MMIO(0xe558)

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 3213c593a55f4..ea674e456cd76 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -2737,6 +2737,15 @@ general_render_compute_wa_init(struct intel_engine_cs 
*engine, struct i915_wa_li
wa_write_or(wal, VDBX_MOD_CTRL, FORCE_MISS_FTLB);
wa_write_or(wal, VEBX_MOD_CTRL, FORCE_MISS_FTLB);
}
+
+   if (IS_DG2(i915)) {
+   /* Performance tuning for Ray-tracing */
+   wa_write_clr_set(wal,
+RT_CTRL,
+RT_CTRL_NUMBER_OF_STACKIDS_MASK,
+REG_FIELD_PREP(RT_CTRL_NUMBER_OF_STACKIDS_MASK,
+   NUMBER_OF_STACKIDS_512));
+   }
  }
  
  static void

Re: [Intel-gfx] [PATCH 2/2] i915/perf: Disable OA sseu config param for gfx12.50+

2022-07-08 Thread Lionel Landwerlin


On 07/07/2022 22:30, Nerlige Ramappa, Umesh wrote:

The global sseu config is applicable only to gen11 platforms where
concurrent media, render and OA use cases may cause some subslices to be
turned off and hence lose NOA configuration. Ideally we want to return
ENODEV for non-gen11 platforms, however, this has shipped with gfx12, so
disable only for gfx12.50+.

v2: gfx12 is already shipped with this, disable for gfx12.50+ (Lionel)

v3: (Matt)
- Update commit message and replace "12.5" with "12.50"
- Replace DRM_DEBUG() with driver specific drm_dbg()

Signed-off-by: Umesh Nerlige Ramappa 

Acked-by: Lionel Landwerlin 

---
  drivers/gpu/drm/i915/i915_perf.c | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index b3beb89884e0..f3c23fe9ad9c 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -3731,6 +3731,13 @@ static int read_properties_unlocked(struct i915_perf 
*perf,
case DRM_I915_PERF_PROP_GLOBAL_SSEU: {
struct drm_i915_gem_context_param_sseu user_sseu;
  
+			if (GRAPHICS_VER_FULL(perf->i915) >= IP_VER(12, 50)) {

+   drm_dbg(>i915->drm,
+   "SSEU config not supported on gfx %x\n",
+   GRAPHICS_VER_FULL(perf->i915));
+   return -ENODEV;
+   }
+
if (copy_from_user(_sseu,
   u64_to_user_ptr(value),
   sizeof(user_sseu))) {

Re: [Intel-gfx] [PATCH 1/2] i915/perf: Replace DRM_DEBUG with driver specific drm_dbg call

2022-07-08 Thread Lionel Landwerlin


On 07/07/2022 22:30, Nerlige Ramappa, Umesh wrote:

DRM_DEBUG is not the right debug call to use in i915 OA, replace it with
driver specific drm_dbg() call (Matt).

Signed-off-by: Umesh Nerlige Ramappa 

Acked-by: Lionel Landwerlin 

---
  drivers/gpu/drm/i915/i915_perf.c | 151 ---
  1 file changed, 100 insertions(+), 51 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 1577ab6754db..b3beb89884e0 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -885,8 +885,9 @@ static int gen8_oa_read(struct i915_perf_stream *stream,
if (ret)
return ret;
  
-		DRM_DEBUG("OA buffer overflow (exponent = %d): force restart\n",

- stream->period_exponent);
+   drm_dbg(>perf->i915->drm,
+   "OA buffer overflow (exponent = %d): force restart\n",
+   stream->period_exponent);
  
  		stream->perf->ops.oa_disable(stream);

stream->perf->ops.oa_enable(stream);
@@ -1108,8 +1109,9 @@ static int gen7_oa_read(struct i915_perf_stream *stream,
if (ret)
return ret;
  
-		DRM_DEBUG("OA buffer overflow (exponent = %d): force restart\n",

- stream->period_exponent);
+   drm_dbg(>perf->i915->drm,
+   "OA buffer overflow (exponent = %d): force restart\n",
+   stream->period_exponent);
  
  		stream->perf->ops.oa_disable(stream);

stream->perf->ops.oa_enable(stream);
@@ -2863,7 +2865,8 @@ static int i915_oa_stream_init(struct i915_perf_stream 
*stream,
int ret;
  
  	if (!props->engine) {

-   DRM_DEBUG("OA engine not specified\n");
+   drm_dbg(>perf->i915->drm,
+   "OA engine not specified\n");
return -EINVAL;
}
  
@@ -2873,18 +2876,21 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,

 * IDs
 */
if (!perf->metrics_kobj) {
-   DRM_DEBUG("OA metrics weren't advertised via sysfs\n");
+   drm_dbg(>perf->i915->drm,
+   "OA metrics weren't advertised via sysfs\n");
return -EINVAL;
}
  
  	if (!(props->sample_flags & SAMPLE_OA_REPORT) &&

(GRAPHICS_VER(perf->i915) < 12 || !stream->ctx)) {
-   DRM_DEBUG("Only OA report sampling supported\n");
+   drm_dbg(>perf->i915->drm,
+   "Only OA report sampling supported\n");
return -EINVAL;
}
  
  	if (!perf->ops.enable_metric_set) {

-   DRM_DEBUG("OA unit not supported\n");
+   drm_dbg(>perf->i915->drm,
+   "OA unit not supported\n");
return -ENODEV;
}
  
@@ -2894,12 +2900,14 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,

 * we currently only allow exclusive access
 */
if (perf->exclusive_stream) {
-   DRM_DEBUG("OA unit already in use\n");
+   drm_dbg(>perf->i915->drm,
+   "OA unit already in use\n");
return -EBUSY;
}
  
  	if (!props->oa_format) {

-   DRM_DEBUG("OA report format not specified\n");
+   drm_dbg(>perf->i915->drm,
+   "OA report format not specified\n");
return -EINVAL;
}
  
@@ -2929,20 +2937,23 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,

if (stream->ctx) {
ret = oa_get_render_ctx_id(stream);
if (ret) {
-   DRM_DEBUG("Invalid context id to filter with\n");
+   drm_dbg(>perf->i915->drm,
+   "Invalid context id to filter with\n");
return ret;
}
}
  
  	ret = alloc_noa_wait(stream);

if (ret) {
-   DRM_DEBUG("Unable to allocate NOA wait batch buffer\n");
+   drm_dbg(>perf->i915->drm,
+   "Unable to allocate NOA wait batch buffer\n");
goto err_noa_wait_alloc;
}
  
  	stream->oa_config = i915_perf_get_oa_config(perf, props->metrics_set);

if (!stream->oa_config) {
-   DRM_DEBUG("Invalid OA config id=%i\n", props->metrics_set);
+   drm_dbg(>perf->i915->drm,
+   "Invalid OA config id=%i\n", props->metrics_set);

Re: [Intel-gfx] [PATCH] i915/perf: Disable OA sseu config param for non-gen11 platforms

2022-07-07 Thread Lionel Landwerlin


On 07/07/2022 00:52, Nerlige Ramappa, Umesh wrote:

The global sseu config is applicable only to gen11 platforms where
concurrent media, render and OA use cases may cause some subslices to be
turned off and hence lose NOA configuration. Return ENODEV for non-gen11
platforms.

Signed-off-by: Umesh Nerlige Ramappa 



The problem is that we have gfx12 platforms shipped using this.

So I guess have to disable it on gfx12.5+ where it was never accepted.


-Lionel



---
  drivers/gpu/drm/i915/i915_perf.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 1577ab6754db..512c163fdbeb 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -3706,6 +3706,12 @@ static int read_properties_unlocked(struct i915_perf 
*perf,
case DRM_I915_PERF_PROP_GLOBAL_SSEU: {
struct drm_i915_gem_context_param_sseu user_sseu;
  
+			if (GRAPHICS_VER(perf->i915) != 11) {

+   DRM_DEBUG("Global SSEU config not supported on 
gen%d\n",
+ GRAPHICS_VER(perf->i915));
+   return -ENODEV;
+   }
+
if (copy_from_user(_sseu,
   u64_to_user_ptr(value),
   sizeof(user_sseu))) {

Re: [Intel-gfx] [PATCH v6 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-07-04 Thread Lionel Landwerlin


On 30/06/2022 20:12, Zanoni, Paulo R wrote:

Can you please explain what happens when we try to write to a range
that's bound as read-only?


It will be mapped as read-only in device page table. Hence any
write access will fail. I would expect a CAT error reported.

What's a CAT error? Does this lead to machine freeze or a GPU hang?
Let's make sure we document this.


Catastrophic error.

Reading the documentation, it seems the behavior depends on the context 
type.


With the Legacy 64bit context type, writes are ignored (BSpec 531) :

    - "For legacy context, the access rights are not applicable and 
should not be considered during page walk."


For Advanced 64bit context type, I think the HW will generate a pagefault.


-Lionel

Re: [Intel-gfx] [PATCH v2] drm/i915/dg2: Add performance workaround 18019455067

2022-06-30 Thread Lionel Landwerlin


On 30/06/2022 01:16, Matt Roper wrote:

On Mon, Jun 27, 2022 at 03:59:28PM +0300, Lionel Landwerlin wrote:

The recommended number of stackIDs for Ray Tracing subsystem is 512
rather than 2048 (default HW programming).

v2: Move the programming to dg2_ctx_gt_tuning_init() (Lucas)

I'm not sure this is actually the correct move.  As far as I can see on
bspec 46261, RT_CTRL isn't part of the engine's context, so we need to
make sure it gets added to engine->wa_list instead of
engine->ctx_wa_list, otherwise it won't be properly re-applied after
engine resets and such.  Most of our other tuning values are part of the
context image, so this one is a bit unusual.

To get it onto the engine->wa_list, the workaround needs to either be
defined via rcs_engine_wa_init() or general_render_compute_wa_init().
The latter is the new, preferred location for registers that are part of
the render/compute reset domain, but that don't live in the RCS engine's
0x2xxx MMIO range (since all RCS and CCS engines get reset together, the
items in general_render_compute_wa_init() will make sure it's dealt with
as part of the handling for the first RCS/CCS engine, so that we won't
miss out on applying it if the platform doesn't have an RCS).

At the moment we don't have too many "tuning" values that we need to set
that aren't part of an engine's context, so we don't yet have a
dedicated "tuning" function for engine-style workarounds like we do with
ctx-style workarounds.


Matt



Thanks Matt,


I didn't pay attention to the register offset and that it's not 
context/engine specific.


Moving it to general_render_compute_wa_init()


-Lionel





Signed-off-by: Lionel Landwerlin 
---
  drivers/gpu/drm/i915/gt/intel_gt_regs.h | 4 
  drivers/gpu/drm/i915/gt/intel_workarounds.c | 5 +
  2 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index 07ef111947b8c..12fc87b957425 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1112,6 +1112,10 @@
  #define   GEN12_PUSH_CONST_DEREF_HOLD_DIS REG_BIT(8)
  
  #define RT_CTRL	_MMIO(0xe530)

+#define   RT_CTRL_NUMBER_OF_STACKIDS_MASK  REG_GENMASK(6, 5)
+#define   NUMBER_OF_STACKIDS_512   2
+#define   NUMBER_OF_STACKIDS_1024  1
+#define   NUMBER_OF_STACKIDS_2048  0
  #define   DIS_NULL_QUERY  REG_BIT(10)
  
  #define EU_PERF_CNTL1_MMIO(0xe558)

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 3213c593a55f4..4d80716b957d4 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -575,6 +575,11 @@ static void dg2_ctx_gt_tuning_init(struct intel_engine_cs 
*engine,
   FF_MODE2_TDS_TIMER_MASK,
   FF_MODE2_TDS_TIMER_128,
   0, false);
+   wa_write_clr_set(wal,
+RT_CTRL,
+RT_CTRL_NUMBER_OF_STACKIDS_MASK,
+REG_FIELD_PREP(RT_CTRL_NUMBER_OF_STACKIDS_MASK,
+   NUMBER_OF_STACKIDS_512));
  }
  
  /*

--
2.34.1

[Intel-gfx] [PATCH v3] drm/i915/dg2: Add performance workaround 18019455067

2022-06-30 Thread Lionel Landwerlin

The recommended number of stackIDs for Ray Tracing subsystem is 512
rather than 2048 (default HW programming).

v2: Move the programming to dg2_ctx_gt_tuning_init() (Lucas)

v3: Move programming to general_render_compute_wa_init() (Matt)

Signed-off-by: Lionel Landwerlin 
---
 drivers/gpu/drm/i915/gt/intel_gt_regs.h | 4 
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 9 +
 2 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index 07ef111947b8c..12fc87b957425 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1112,6 +1112,10 @@
 #define   GEN12_PUSH_CONST_DEREF_HOLD_DIS  REG_BIT(8)
 
 #define RT_CTRL_MMIO(0xe530)
+#define   RT_CTRL_NUMBER_OF_STACKIDS_MASK  REG_GENMASK(6, 5)
+#define   NUMBER_OF_STACKIDS_512   2
+#define   NUMBER_OF_STACKIDS_1024  1
+#define   NUMBER_OF_STACKIDS_2048  0
 #define   DIS_NULL_QUERY   REG_BIT(10)
 
 #define EU_PERF_CNTL1  _MMIO(0xe558)
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 3213c593a55f4..ea674e456cd76 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -2737,6 +2737,15 @@ general_render_compute_wa_init(struct intel_engine_cs 
*engine, struct i915_wa_li
wa_write_or(wal, VDBX_MOD_CTRL, FORCE_MISS_FTLB);
wa_write_or(wal, VEBX_MOD_CTRL, FORCE_MISS_FTLB);
}
+
+   if (IS_DG2(i915)) {
+   /* Performance tuning for Ray-tracing */
+   wa_write_clr_set(wal,
+RT_CTRL,
+RT_CTRL_NUMBER_OF_STACKIDS_MASK,
+REG_FIELD_PREP(RT_CTRL_NUMBER_OF_STACKIDS_MASK,
+   NUMBER_OF_STACKIDS_512));
+   }
 }
 
 static void
-- 
2.34.1

[Intel-gfx] [PATCH v2] drm/i915/dg2: Add performance workaround 18019455067

2022-06-27 Thread Lionel Landwerlin

The recommended number of stackIDs for Ray Tracing subsystem is 512
rather than 2048 (default HW programming).

v2: Move the programming to dg2_ctx_gt_tuning_init() (Lucas)

Signed-off-by: Lionel Landwerlin 
---
 drivers/gpu/drm/i915/gt/intel_gt_regs.h | 4 
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 5 +
 2 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index 07ef111947b8c..12fc87b957425 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1112,6 +1112,10 @@
 #define   GEN12_PUSH_CONST_DEREF_HOLD_DIS  REG_BIT(8)
 
 #define RT_CTRL_MMIO(0xe530)
+#define   RT_CTRL_NUMBER_OF_STACKIDS_MASK  REG_GENMASK(6, 5)
+#define   NUMBER_OF_STACKIDS_512   2
+#define   NUMBER_OF_STACKIDS_1024  1
+#define   NUMBER_OF_STACKIDS_2048  0
 #define   DIS_NULL_QUERY   REG_BIT(10)
 
 #define EU_PERF_CNTL1  _MMIO(0xe558)
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 3213c593a55f4..4d80716b957d4 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -575,6 +575,11 @@ static void dg2_ctx_gt_tuning_init(struct intel_engine_cs 
*engine,
   FF_MODE2_TDS_TIMER_MASK,
   FF_MODE2_TDS_TIMER_128,
   0, false);
+   wa_write_clr_set(wal,
+RT_CTRL,
+RT_CTRL_NUMBER_OF_STACKIDS_MASK,
+REG_FIELD_PREP(RT_CTRL_NUMBER_OF_STACKIDS_MASK,
+   NUMBER_OF_STACKIDS_512));
 }
 
 /*
-- 
2.34.1

Re: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-06-23 Thread Lionel Landwerlin


On 23/06/2022 14:05, Tvrtko Ursulin wrote:


On 23/06/2022 09:57, Lionel Landwerlin wrote:

On 23/06/2022 11:27, Tvrtko Ursulin wrote:


After a vm_unbind, UMD can re-bind to same VA range against an 
active VM.
Though I am not sue with Mesa usecase if that new mapping is 
required for
running GPU job or it will be for the next submission. But ensuring 
the

tlb flush upon unbind, KMD can ensure correctness.


Isn't that their problem? If they re-bind for submitting _new_ work 
then they get the flush as part of batch buffer pre-amble. 


In the non sparse case, if a VA range is unbound, it is invalid to 
use that range for anything until it has been rebound by something else.


We'll take the fence provided by vm_bind and put it as a wait fence 
on the next execbuffer.


It might be safer in case of memory over fetching?


TLB flush will have to happen at some point right?

What's the alternative to do it in unbind?


Currently TLB flush happens from the ring before every BB_START and 
also when i915 returns the backing store pages to the system.


For the former, I haven't seen any mention that for execbuf3 there are 
plans to stop doing it? Anyway, as long as this is kept and sequence 
of bind[1..N]+execbuf is safe and correctly sees all the preceding binds.
Hence about the alternative to doing it in unbind - first I think lets 
state the problem that is trying to solve.


For instance is it just for the compute "append work to the running 
batch" use case? I honestly don't remember how was that supposed to 
work so maybe the tlb flush on bind was supposed to deal with that 
scenario?


Or you see a problem even for Mesa with the current model?

Regards,

Tvrtko



As far as I can tell, all the binds should have completed before execbuf 
starts if you follow the vulkan sparse binding rules.


For non-sparse, the UMD will take care of it.

I think we're fine.


-Lionel

Re: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-06-23 Thread Lionel Landwerlin


On 22/06/2022 18:12, Niranjana Vishwanathapura wrote:

On Wed, Jun 22, 2022 at 09:10:07AM +0100, Tvrtko Ursulin wrote:


On 22/06/2022 04:56, Niranjana Vishwanathapura wrote:

VM_BIND and related uapi definitions

v2: Reduce the scope to simple Mesa use case.
v3: Expand VM_UNBIND documentation and add
    I915_GEM_VM_BIND/UNBIND_FENCE_VALID
    and I915_GEM_VM_BIND_TLB_FLUSH flags.

Signed-off-by: Niranjana Vishwanathapura 


---
 Documentation/gpu/rfc/i915_vm_bind.h | 243 +++
 1 file changed, 243 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
b/Documentation/gpu/rfc/i915_vm_bind.h

new file mode 100644
index ..fa23b2d7ec6f
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.h
@@ -0,0 +1,243 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+/**
+ * DOC: I915_PARAM_HAS_VM_BIND
+ *
+ * VM_BIND feature availability.
+ * See typedef drm_i915_getparam_t param.
+ */
+#define I915_PARAM_HAS_VM_BIND    57
+
+/**
+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
+ *
+ * Flag to opt-in for VM_BIND mode of binding during VM creation.
+ * See struct drm_i915_gem_vm_control flags.
+ *
+ * The older execbuf2 ioctl will not support VM_BIND mode of 
operation.
+ * For VM_BIND mode, we have new execbuf3 ioctl which will not 
accept any

+ * execlist (See struct drm_i915_gem_execbuffer3 for more details).
+ *
+ */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)
+
+/* VM_BIND related ioctls */
+#define DRM_I915_GEM_VM_BIND    0x3d
+#define DRM_I915_GEM_VM_UNBIND    0x3e
+#define DRM_I915_GEM_EXECBUFFER3    0x3f
+
+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + 
DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + 
DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_EXECBUFFER3 DRM_IOWR(DRM_COMMAND_BASE + 
DRM_I915_GEM_EXECBUFFER3, struct drm_i915_gem_execbuffer3)

+
+/**
+ * struct drm_i915_gem_vm_bind_fence - Bind/unbind completion 
notification.

+ *
+ * A timeline out fence for vm_bind/unbind completion notification.
+ */
+struct drm_i915_gem_vm_bind_fence {
+    /** @handle: User's handle for a drm_syncobj to signal. */
+    __u32 handle;
+
+    /** @rsvd: Reserved, MBZ */
+    __u32 rsvd;
+
+    /**
+ * @value: A point in the timeline.
+ * Value must be 0 for a binary drm_syncobj. A Value of 0 for a
+ * timeline drm_syncobj is invalid as it turns a drm_syncobj 
into a

+ * binary one.
+ */
+    __u64 value;
+};
+
+/**
+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
+ *
+ * This structure is passed to VM_BIND ioctl and specifies the 
mapping of GPU
+ * virtual address (VA) range to the section of an object that 
should be bound

+ * in the device page table of the specified address space (VM).
+ * The VA range specified must be unique (ie., not currently bound) 
and can
+ * be mapped to whole object or a section of the object (partial 
binding).
+ * Multiple VA mappings can be created to the same section of the 
object

+ * (aliasing).
+ *
+ * The @start, @offset and @length should be 4K page aligned. 
However the DG2
+ * and XEHPSDV has 64K page size for device local-memory and has 
compact page
+ * table. On those platforms, for binding device local-memory 
objects, the
+ * @start should be 2M aligned, @offset and @length should be 64K 
aligned.


Should some error codes be documented and has the ability to 
programmatically probe the alignment restrictions been considered?




Currently what we have internally is that -EINVAL is returned if the 
sart, offset
and length are not aligned. If the specified mapping already exits, we 
return
-EEXIST. If there are conflicts in the VA range and VA range can't be 
reserved,
then -ENOSPC is returned. I can add this documentation here. But I am 
worried
that there will be more suggestions/feedback about error codes while 
reviewing

the code patch series, and we have to revisit it again.



That's not really a good excuse to not document.




+ * Also, on those platforms, it is not allowed to bind an device 
local-memory
+ * object and a system memory object in a single 2M section of VA 
range.


Text should be clear whether "not allowed" means there will be an 
error returned, or it will appear to work but bad things will happen.




Yah, error returned, will fix.


+ */
+struct drm_i915_gem_vm_bind {
+    /** @vm_id: VM (address space) id to bind */
+    __u32 vm_id;
+
+    /** @handle: Object handle */
+    __u32 handle;
+
+    /** @start: Virtual Address start to bind */
+    __u64 start;
+
+    /** @offset: Offset in object to bind */
+    __u64 offset;
+
+    /** @length: Length of mapping to bind */
+    __u64 length;
+
+    /**
+ * @flags: Supported flags are:
+ *
+ * I915_GEM_VM_BIND_FENCE_VALID:
+ * @fence is valid, needs bind completion

Re: [Intel-gfx] [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-06-23 Thread Lionel Landwerlin


On 23/06/2022 11:27, Tvrtko Ursulin wrote:


After a vm_unbind, UMD can re-bind to same VA range against an active 
VM.
Though I am not sue with Mesa usecase if that new mapping is required 
for

running GPU job or it will be for the next submission. But ensuring the
tlb flush upon unbind, KMD can ensure correctness.


Isn't that their problem? If they re-bind for submitting _new_ work 
then they get the flush as part of batch buffer pre-amble. 


In the non sparse case, if a VA range is unbound, it is invalid to use 
that range for anything until it has been rebound by something else.


We'll take the fence provided by vm_bind and put it as a wait fence on 
the next execbuffer.


It might be safer in case of memory over fetching?


TLB flush will have to happen at some point right?

What's the alternative to do it in unbind?


-Lionel

[Intel-gfx] [PATCH] drm/i915/dg2: Add performance workaround 18019455067

2022-06-22 Thread Lionel Landwerlin

This is the recommended value for optimal performance.

Signed-off-by: Lionel Landwerlin 
---
 drivers/gpu/drm/i915/gt/intel_gt_regs.h | 3 +++
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index 07ef111947b8c..a50b5790e434e 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1112,6 +1112,9 @@
 #define   GEN12_PUSH_CONST_DEREF_HOLD_DIS  REG_BIT(8)
 
 #define RT_CTRL_MMIO(0xe530)
+#define   NUMBER_OF_STACKIDS_512   (2 << 5)
+#define   NUMBER_OF_STACKIDS_1024  (1 << 5)
+#define   NUMBER_OF_STACKIDS_2048  (0 << 5)
 #define   DIS_NULL_QUERY   REG_BIT(10)
 
 #define EU_PERF_CNTL1  _MMIO(0xe558)
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 3213c593a55f4..a8a389d36986c 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -2106,6 +2106,9 @@ rcs_engine_wa_init(struct intel_engine_cs *engine, struct 
i915_wa_list *wal)
 * performance guide section.
 */
wa_write_or(wal, XEHP_L3SCQREG7, BLEND_FILL_CACHING_OPT_DIS);
+
+/* Wa_18019455067:dg2 / BSpec 68331/54402 */
+wa_write_or(wal, RT_CTRL, NUMBER_OF_STACKIDS_512);
}
 
if (IS_DG2_GRAPHICS_STEP(i915, G11, STEP_A0, STEP_B0)) {
-- 
2.32.0

Re: [Intel-gfx] [PATCH v2 01/12] drm/doc: add rfc section for small BAR uapi

2022-06-21 Thread Lionel Landwerlin


On 21/06/2022 13:44, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

v2:
   - Some spelling fixes and other small tweaks. (Akeem & Thomas)
   - Rework error capture interactions, including no longer needing
 NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
   - Add probed_cpu_visible_size. (Lionel)
v3:
   - Drop the vma query for now.
   - Add unallocated_cpu_visible_size as part of the region query.
   - Improve the docs some more, including documenting the expected
 behaviour on older kernels, since this came up in some offline
 discussion.
v4:
   - Various improvements all over. (Tvrtko)

v5:
   - Include newer integrated platforms when applying the non-recoverable
 context and error capture restriction. (Thomas)

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Lionel Landwerlin 
Cc: Tvrtko Ursulin 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: Akeem G Abodunrin 
Cc: mesa-...@lists.freedesktop.org
Acked-by: Tvrtko Ursulin 
Acked-by: Akeem G Abodunrin 



With Jordan with have changes for Anv/Iris : 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16739


Acked-by: Lionel Landwerlin 



---
  Documentation/gpu/rfc/i915_small_bar.h   | 189 +++
  Documentation/gpu/rfc/i915_small_bar.rst |  47 ++
  Documentation/gpu/rfc/index.rst  |   4 +
  3 files changed, 240 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h 
b/Documentation/gpu/rfc/i915_small_bar.h
new file mode 100644
index ..752bb2ceb399
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,189 @@
+/**
+ * struct __drm_i915_memory_region_info - Describes one region as known to the
+ * driver.
+ *
+ * Note this is using both struct drm_i915_query_item and struct 
drm_i915_query.
+ * For this new query we are adding the new query id 
DRM_I915_QUERY_MEMORY_REGIONS
+ * at _i915_query_item.query_id.
+ */
+struct __drm_i915_memory_region_info {
+   /** @region: The class:instance pair encoding */
+   struct drm_i915_gem_memory_class_instance region;
+
+   /** @rsvd0: MBZ */
+   __u32 rsvd0;
+
+   /**
+* @probed_size: Memory probed by the driver (-1 = unknown)
+*
+* Note that it should not be possible to ever encounter a zero value
+* here, also note that no current region type will ever return -1 here.
+* Although for future region types, this might be a possibility. The
+* same applies to the other size fields.
+*/
+   __u64 probed_size;
+
+   /**
+* @unallocated_size: Estimate of memory remaining (-1 = unknown)
+*
+* Requires CAP_PERFMON or CAP_SYS_ADMIN to get reliable accounting.
+* Without this (or if this is an older kernel) the value here will
+* always equal the @probed_size. Note this is only currently tracked
+* for I915_MEMORY_CLASS_DEVICE regions (for other types the value here
+* will always equal the @probed_size).
+*/
+   __u64 unallocated_size;
+
+   union {
+   /** @rsvd1: MBZ */
+   __u64 rsvd1[8];
+   struct {
+   /**
+* @probed_cpu_visible_size: Memory probed by the driver
+* that is CPU accessible. (-1 = unknown).
+*
+* This will be always be <= @probed_size, and the
+* remainder (if there is any) will not be CPU
+* accessible.
+*
+* On systems without small BAR, the @probed_size will
+* always equal the @probed_cpu_visible_size, since all
+* of it will be CPU accessible.
+*
+* Note this is only tracked for
+* I915_MEMORY_CLASS_DEVICE regions (for other types the
+* value here will always equal the @probed_size).
+*
+* Note that if the value returned here is zero, then
+* this must be an old kernel which lacks the relevant
+* small-bar uAPI support (including
+* I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS), but on
+* such systems we should never actually end up with a
+* small BAR configuration, assuming we are able to load
+* the kernel module. Hence it should be safe to treat
+* this the same as when @probed_cpu_visible_size ==
+* @probed_size.
+*/
+   __u64 probed_cpu_visib

Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

2022-06-14 Thread Lionel Landwerlin


On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:

On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:



Regards,
Oak


-Original Message-
From: Intel-gfx  On Behalf 
Of Niranjana

Vishwanathapura
Sent: June 10, 2022 1:43 PM
To: Landwerlin, Lionel G 
Cc: Intel GFX ; Maling list - DRI 
developers de...@lists.freedesktop.org>; Hellstrom, Thomas 
;

Wilson, Chris P ; Vetter, Daniel
; Christian König 
Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature 
design

document

On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
>On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
>>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
>>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
>>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>>>>>  On 09/06/2022 00:55, Jason Ekstrand wrote:
>>>>>
>>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>>>>>  wrote:
>>>>>
>>>>>  On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin 
wrote:

>>>>>  >
>>>>>  >
>>>>>  >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>>>>  >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
>>>>>Vishwanathapura
>>>>>  wrote:
>>>>>  >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
>>>>>Ekstrand wrote:
>>>>>  >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana 
Vishwanathapura

>>>>>  >>>>  wrote:
>>>>>  >>>>
>>>>>  >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel
>>>>>Landwerlin
>>>>>      wrote:
>>>>>  >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>>  >>>>   >
>>>>>  >>>>   > On Thu, Jun 2, 2022 at 3:11 PM Niranjana
>>>>>Vishwanathapura
>>>>>  >>>>   >  wrote:
>>>>>  >>>>   >
>>>>>  >>>>   >   On Wed, Jun 01, 2022 at 01:28:36PM -0700, 
Matthew

>>>>>  >>>>Brost wrote:
>>>>>  >>>>   >   >On Wed, Jun 01, 2022 at 05:25:49PM +0300, 
Lionel

>>>>>  Landwerlin
>>>>>  >>>>   wrote:
>>>>>  >>>>   > >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>>>>>  wrote:
>>>>>  >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>>  >>>>   binding/unbinding
>>>>>  >>>>   >   the mapping in an
>>>>>  >>>>   > >> > +async worker. The binding and
>>>>>unbinding will
>>>>>  >>>>work like a
>>>>>  >>>>   special
>>>>>  >>>>   >   GPU engine.
>>>>>  >>>>   > >> > +The binding and unbinding operations are
>>>>>  serialized and
>>>>>  >>>>   will
>>>>>  >>>>   >   wait on specified
>>>>>  >>>>   > >> > +input fences before the operation
>>>>>and will signal
>>>>>  the
>>>>>  >>>>   output
>>>>>  >>>>   >   fences upon the
>>>>>  >>>>   > >> > +completion of the operation. Due to
>>>>>  serialization,
>>>>>  >>>>   completion of
>>>>>  >>>>   >   an operation
>>>>>  >>>>   > >> > +will also indicate that all
>>>>>previous operations
>>>>>  >>>>are also
>>>>>  >>>>   > complete.
>>>>>  >>>>   > >>
>>>>>  >>>>   > >> I guess we should avoid saying "will
>>>>>immediately
>>>>>  start
>>>>>  >>>>   > binding/unbinding" if
>>>>>  >>>>   > >> there are fences involved.
>>>>>  >>>>   > >>
>>>>>  >>>>   > >> And the fact that i

Re: [Intel-gfx] [PATCH 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-06-14 Thread Lionel Landwerlin


On 10/06/2022 11:53, Matthew Brost wrote:

On Fri, Jun 10, 2022 at 12:07:11AM -0700, Niranjana Vishwanathapura wrote:

VM_BIND and related uapi definitions

Signed-off-by: Niranjana Vishwanathapura 
---
  Documentation/gpu/rfc/i915_vm_bind.h | 490 +++
  1 file changed, 490 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
b/Documentation/gpu/rfc/i915_vm_bind.h
new file mode 100644
index ..9fc854969cfb
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.h
@@ -0,0 +1,490 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+/**
+ * DOC: I915_PARAM_HAS_VM_BIND
+ *
+ * VM_BIND feature availability.
+ * See typedef drm_i915_getparam_t param.
+ * bit[0]: If set, VM_BIND is supported, otherwise not.
+ * bits[8-15]: VM_BIND implementation version.
+ * version 0 will not have VM_BIND/UNBIND timeline fence array support.
+ */
+#define I915_PARAM_HAS_VM_BIND 57
+
+/**
+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
+ *
+ * Flag to opt-in for VM_BIND mode of binding during VM creation.
+ * See struct drm_i915_gem_vm_control flags.
+ *
+ * The older execbuf2 ioctl will not support VM_BIND mode of operation.
+ * For VM_BIND mode, we have new execbuf3 ioctl which will not accept any
+ * execlist (See struct drm_i915_gem_execbuffer3 for more details).
+ *
+ */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND   (1 << 0)
+
+/**
+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
+ *
+ * Flag to declare context as long running.
+ * See struct drm_i915_gem_context_create_ext flags.
+ *
+ * Usage of dma-fence expects that they complete in reasonable amount of time.
+ * Compute on the other hand can be long running. Hence it is not appropriate
+ * for compute contexts to export request completion dma-fence to user.
+ * The dma-fence usage will be limited to in-kernel consumption only.
+ * Compute contexts need to use user/memory fence.
+ *
+ * So, long running contexts do not support output fences. Hence,
+ * I915_EXEC_FENCE_SIGNAL (See _i915_gem_exec_fence.flags) is expected
+ * to be not used. DRM_I915_GEM_WAIT ioctl call is also not supported for
+ * objects mapped to long running contexts.
+ */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
+
+/* VM_BIND related ioctls */
+#define DRM_I915_GEM_VM_BIND   0x3d
+#define DRM_I915_GEM_VM_UNBIND 0x3e
+#define DRM_I915_GEM_EXECBUFFER3   0x3f
+#define DRM_I915_GEM_WAIT_USER_FENCE   0x40
+
+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + 
DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_VM_UNBIND   DRM_IOWR(DRM_COMMAND_BASE + 
DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_EXECBUFFER3 DRM_IOWR(DRM_COMMAND_BASE + 
DRM_I915_GEM_EXECBUFFER3, struct drm_i915_gem_execbuffer3)
+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + 
DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+
+/**
+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
+ *
+ * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
+ * virtual address (VA) range to the section of an object that should be bound
+ * in the device page table of the specified address space (VM).
+ * The VA range specified must be unique (ie., not currently bound) and can
+ * be mapped to whole object or a section of the object (partial binding).
+ * Multiple VA mappings can be created to the same section of the object
+ * (aliasing).
+ *
+ * The @queue_idx specifies the queue to use for binding. Same queue can be
+ * used for both VM_BIND and VM_UNBIND calls. All submitted bind and unbind
+ * operations in a queue are performed in the order of submission.
+ *
+ * The @start, @offset and @length should be 4K page aligned. However the DG2
+ * and XEHPSDV has 64K page size for device local-memory and has compact page
+ * table. On those platforms, for binding device local-memory objects, the
+ * @start should be 2M aligned, @offset and @length should be 64K aligned.
+ * Also, on those platforms, it is not allowed to bind an device local-memory
+ * object and a system memory object in a single 2M section of VA range.
+ */
+struct drm_i915_gem_vm_bind {
+   /** @vm_id: VM (address space) id to bind */
+   __u32 vm_id;
+
+   /** @queue_idx: Index of queue for binding */
+   __u32 queue_idx;
+
+   /** @rsvd: Reserved, MBZ */
+   __u32 rsvd;
+
+   /** @handle: Object handle */
+   __u32 handle;
+
+   /** @start: Virtual Address start to bind */
+   __u64 start;
+
+   /** @offset: Offset in object to bind */
+   __u64 offset;
+
+   /** @length: Length of mapping to bind */
+   __u64 length;

This probably isn't needed. We are never going to unbind a subset of a
VMA are we? That being said it can't hurt as a sanity check (e.g.
internal

Re: [Intel-gfx] [PATCH 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-06-10 Thread Lionel Landwerlin


On 10/06/2022 13:37, Tvrtko Ursulin wrote:


On 10/06/2022 08:07, Niranjana Vishwanathapura wrote:

VM_BIND and related uapi definitions

Signed-off-by: Niranjana Vishwanathapura 


---
  Documentation/gpu/rfc/i915_vm_bind.h | 490 +++
  1 file changed, 490 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
b/Documentation/gpu/rfc/i915_vm_bind.h

new file mode 100644
index ..9fc854969cfb
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.h
@@ -0,0 +1,490 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+/**
+ * DOC: I915_PARAM_HAS_VM_BIND
+ *
+ * VM_BIND feature availability.
+ * See typedef drm_i915_getparam_t param.
+ * bit[0]: If set, VM_BIND is supported, otherwise not.
+ * bits[8-15]: VM_BIND implementation version.
+ * version 0 will not have VM_BIND/UNBIND timeline fence array support.
+ */
+#define I915_PARAM_HAS_VM_BIND    57
+
+/**
+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
+ *
+ * Flag to opt-in for VM_BIND mode of binding during VM creation.
+ * See struct drm_i915_gem_vm_control flags.
+ *
+ * The older execbuf2 ioctl will not support VM_BIND mode of operation.
+ * For VM_BIND mode, we have new execbuf3 ioctl which will not 
accept any

+ * execlist (See struct drm_i915_gem_execbuffer3 for more details).
+ *
+ */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)
+
+/**
+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
+ *
+ * Flag to declare context as long running.
+ * See struct drm_i915_gem_context_create_ext flags.
+ *
+ * Usage of dma-fence expects that they complete in reasonable 
amount of time.
+ * Compute on the other hand can be long running. Hence it is not 
appropriate

+ * for compute contexts to export request completion dma-fence to user.
+ * The dma-fence usage will be limited to in-kernel consumption only.
+ * Compute contexts need to use user/memory fence.
+ *
+ * So, long running contexts do not support output fences. Hence,
+ * I915_EXEC_FENCE_SIGNAL (See _i915_gem_exec_fence.flags) is 
expected
+ * to be not used. DRM_I915_GEM_WAIT ioctl call is also not 
supported for

+ * objects mapped to long running contexts.
+ */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
+
+/* VM_BIND related ioctls */
+#define DRM_I915_GEM_VM_BIND    0x3d
+#define DRM_I915_GEM_VM_UNBIND    0x3e
+#define DRM_I915_GEM_EXECBUFFER3    0x3f
+#define DRM_I915_GEM_WAIT_USER_FENCE    0x40
+
+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + 
DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + 
DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_EXECBUFFER3 DRM_IOWR(DRM_COMMAND_BASE + 
DRM_I915_GEM_EXECBUFFER3, struct drm_i915_gem_execbuffer3)
+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE 
+ DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)

+
+/**
+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind.
+ *
+ * This structure is passed to VM_BIND ioctl and specifies the 
mapping of GPU
+ * virtual address (VA) range to the section of an object that 
should be bound

+ * in the device page table of the specified address space (VM).
+ * The VA range specified must be unique (ie., not currently bound) 
and can
+ * be mapped to whole object or a section of the object (partial 
binding).
+ * Multiple VA mappings can be created to the same section of the 
object

+ * (aliasing).
+ *
+ * The @queue_idx specifies the queue to use for binding. Same queue 
can be
+ * used for both VM_BIND and VM_UNBIND calls. All submitted bind and 
unbind

+ * operations in a queue are performed in the order of submission.
+ *
+ * The @start, @offset and @length should be 4K page aligned. 
However the DG2
+ * and XEHPSDV has 64K page size for device local-memory and has 
compact page
+ * table. On those platforms, for binding device local-memory 
objects, the
+ * @start should be 2M aligned, @offset and @length should be 64K 
aligned.
+ * Also, on those platforms, it is not allowed to bind an device 
local-memory
+ * object and a system memory object in a single 2M section of VA 
range.

+ */
+struct drm_i915_gem_vm_bind {
+    /** @vm_id: VM (address space) id to bind */
+    __u32 vm_id;
+
+    /** @queue_idx: Index of queue for binding */
+    __u32 queue_idx;


I have a question here to which I did not find an answer by browsing 
the old threads.


Queue index appears to be an implicit synchronisation mechanism, 
right? Operations on the same index are executed/complete in order of 
ioctl submission?


Do we _have_ to implement this on the kernel side and could just allow 
in/out fence and let userspace deal with it?



It orders operations like in a queue. Which is kind of what happens with 
existing queues/engines.


If I understood correctly, it's going to be a kthread + a linked list right?

Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

2022-06-10 Thread Lionel Landwerlin


On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:

On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:

On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:

On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:

  On 09/06/2022 00:55, Jason Ekstrand wrote:

    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
     wrote:

  On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
  >
  >
  >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
  >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana 
Vishwanathapura

  wrote:
  >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand 
wrote:

  >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
  >>>>  wrote:
  >>>>
      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel 
Landwerlin

  wrote:
  >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
  >>>>   >
  >>>>   > On Thu, Jun 2, 2022 at 3:11 PM Niranjana 
Vishwanathapura

  >>>>   >  wrote:
  >>>>   >
  >>>>   >   On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
  >>>>Brost wrote:
  >>>>   >   >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
  Landwerlin
  >>>>   wrote:
  >>>>   >   >> On 17/05/2022 21:32, Niranjana Vishwanathapura
  wrote:
  >>>>   >   >> > +VM_BIND/UNBIND ioctl will immediately start
  >>>>   binding/unbinding
  >>>>   >   the mapping in an
  >>>>   >   >> > +async worker. The binding and unbinding 
will

  >>>>work like a
  >>>>   special
  >>>>   >   GPU engine.
  >>>>   >   >> > +The binding and unbinding operations are
  serialized and
  >>>>   will
  >>>>   >   wait on specified
  >>>>   >   >> > +input fences before the operation and 
will signal

  the
  >>>>   output
  >>>>   >   fences upon the
  >>>>   >   >> > +completion of the operation. Due to
  serialization,
  >>>>   completion of
  >>>>   >   an operation
  >>>>   >   >> > +will also indicate that all previous 
operations

  >>>>are also
  >>>>   >   complete.
  >>>>   >   >>
  >>>>   >   >> I guess we should avoid saying "will 
immediately

  start
  >>>>   >   binding/unbinding" if
  >>>>   >   >> there are fences involved.
  >>>>   >   >>
  >>>>   >   >> And the fact that it's happening in an async
  >>>>worker seem to
  >>>>   imply
  >>>>   >   it's not
  >>>>   >   >> immediate.
  >>>>   >   >>
  >>>>   >
  >>>>   >   Ok, will fix.
  >>>>   >   This was added because in earlier design 
binding was

  deferred
  >>>>   until
  >>>>   >   next execbuff.
  >>>>   >   But now it is non-deferred (immediate in that 
sense).

  >>>>But yah,
  >>>>   this is
  >>>>   >   confusing
  >>>>   >   and will fix it.
  >>>>   >
  >>>>   >   >>
  >>>>   >   >> I have a question on the behavior of the bind
  >>>>operation when
  >>>>   no
  >>>>   >   input fence
  >>>>   >   >> is provided. Let say I do :
  >>>>   >   >>
  >>>>   >   >> VM_BIND (out_fence=fence1)
  >>>>   >   >>
  >>>>   >   >> VM_BIND (out_fence=fence2)
  >>>>   >   >>
  >>>>   >   >> VM_BIND (out_fence=fence3)
  >>>>   >   >>
  >>>>   >   >>
  >>>>   >   >> In what order are the fences going to be 
signaled?

  >>>>   >   >>
  >>>>   >   >> In the order of VM_BIND ioctls? Or out of 
order?

  >>>>   >   >&

Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

2022-06-10 Thread Lionel Landwerlin


On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:

On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:

  On 09/06/2022 00:55, Jason Ekstrand wrote:

    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
     wrote:

  On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
  >
  >
  >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
  >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana 
Vishwanathapura

  wrote:
  >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
  >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
  >>>>  wrote:
  >>>>
      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
  wrote:
  >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
  >>>>   >
  >>>>   > On Thu, Jun 2, 2022 at 3:11 PM Niranjana 
Vishwanathapura

  >>>>   >  wrote:
  >>>>   >
  >>>>   >   On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
  >>>>Brost wrote:
  >>>>   >   >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
  Landwerlin
  >>>>   wrote:
  >>>>   >   >> On 17/05/2022 21:32, Niranjana Vishwanathapura
  wrote:
  >>>>   >   >> > +VM_BIND/UNBIND ioctl will immediately start
  >>>>   binding/unbinding
  >>>>   >   the mapping in an
  >>>>   >   >> > +async worker. The binding and unbinding will
  >>>>work like a
  >>>>   special
  >>>>   >   GPU engine.
  >>>>   >   >> > +The binding and unbinding operations are
  serialized and
  >>>>   will
  >>>>   >   wait on specified
  >>>>   >   >> > +input fences before the operation and will 
signal

  the
  >>>>   output
  >>>>   >   fences upon the
  >>>>   >   >> > +completion of the operation. Due to
  serialization,
  >>>>   completion of
  >>>>   >   an operation
  >>>>   >   >> > +will also indicate that all previous 
operations

  >>>>are also
  >>>>   >   complete.
  >>>>   >   >>
  >>>>   >   >> I guess we should avoid saying "will immediately
  start
  >>>>   >   binding/unbinding" if
  >>>>   >   >> there are fences involved.
  >>>>   >   >>
  >>>>   >   >> And the fact that it's happening in an async
  >>>>worker seem to
  >>>>   imply
  >>>>   >   it's not
  >>>>   >   >> immediate.
  >>>>   >   >>
  >>>>   >
  >>>>   >   Ok, will fix.
  >>>>   >   This was added because in earlier design binding 
was

  deferred
  >>>>   until
  >>>>   >   next execbuff.
  >>>>   >   But now it is non-deferred (immediate in that 
sense).

  >>>>But yah,
  >>>>   this is
  >>>>   >   confusing
  >>>>   >   and will fix it.
  >>>>   >
  >>>>   >   >>
  >>>>   >   >> I have a question on the behavior of the bind
  >>>>operation when
  >>>>   no
  >>>>   >   input fence
  >>>>   >   >> is provided. Let say I do :
  >>>>   >   >>
  >>>>   >   >> VM_BIND (out_fence=fence1)
  >>>>   >   >>
  >>>>   >   >> VM_BIND (out_fence=fence2)
  >>>>   >   >>
  >>>>   >   >> VM_BIND (out_fence=fence3)
  >>>>   >   >>
  >>>>   >   >>
  >>>>   >   >> In what order are the fences going to be 
signaled?

  >>>>   >   >>
  >>>>   >   >> In the order of VM_BIND ioctls? Or out of order?
  >>>>   >   >>
  >>>>   >   >> Because you wrote "serialized I assume it's : in
  order
  >>>>

Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

2022-06-09 Thread Lionel Landwerlin


On 09/06/2022 00:55, Jason Ekstrand wrote:
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura 
 wrote:


On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>
>
>On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
Vishwanathapura wrote:
>>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>>  wrote:
>>>>
>>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
wrote:
>>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>   >
>>>>   > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>>>>   >  wrote:
>>>>   >
>>>>   >   On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>>Brost wrote:
>>>>   >   >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
Landwerlin
>>>>   wrote:
>>>>   >   >> On 17/05/2022 21:32, Niranjana Vishwanathapura
wrote:
>>>>   >   >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>   binding/unbinding
>>>>   >   the mapping in an
>>>>   >   >> > +async worker. The binding and unbinding will
>>>>work like a
>>>>   special
>>>>   >   GPU engine.
>>>>   >   >> > +The binding and unbinding operations are
serialized and
>>>>   will
>>>>   >   wait on specified
>>>>   >   >> > +input fences before the operation and will
signal the
>>>>   output
>>>>   >   fences upon the
>>>>   >   >> > +completion of the operation. Due to
serialization,
>>>>   completion of
>>>>   >   an operation
>>>>   >   >> > +will also indicate that all previous operations
>>>>are also
>>>>   >   complete.
>>>>   >   >>
>>>>   >   >> I guess we should avoid saying "will immediately
start
>>>>   >   binding/unbinding" if
>>>>   >   >> there are fences involved.
>>>>   >   >>
>>>>   >   >> And the fact that it's happening in an async
>>>>worker seem to
>>>>   imply
>>>>   >   it's not
>>>>   >   >> immediate.
>>>>   >   >>
>>>>   >
>>>>   >   Ok, will fix.
>>>>   >   This was added because in earlier design binding
was deferred
>>>>   until
>>>>   >   next execbuff.
>>>>   >   But now it is non-deferred (immediate in that sense).
>>>>But yah,
>>>>   this is
>>>>   >   confusing
>>>>   >   and will fix it.
>>>>   >
>>>>   >   >>
>>>>   >   >> I have a question on the behavior of the bind
>>>>operation when
>>>>   no
>>>>   >   input fence
>>>>   >   >> is provided. Let say I do :
>>>>   >   >>
>>>>   >   >> VM_BIND (out_fence=fence1)
>>>>   >   >>
>>>>   >   >> VM_BIND (out_fence=fence2)
>>>>   >   >>
>>>>   >   >> VM_BIND (out_fence=fence3)
>>>>   >   >>
>>>>   >   >>
>>>>   >   >> In what order are the fences going to be signaled?
>>>>   >   >>
>>>>   >   >> In the order of VM_BIND ioctls? Or out of order?
>>>>   >   >>
>>>>   >   >> Because you wrote "serialized I assume it's : in
order
>>>>   >   >>
>>>>   >
>>>>   >   Yes, in the order of VM_BIND/UNBIND ioctls. Note that
>>>>bind and
>>>>   unbind
>>>>   >   will use
>>>>   >   the same queue and hence are ordered.
>&

Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-06-08 Thread Lionel Landwerlin


On 08/06/2022 11:36, Tvrtko Ursulin wrote:


On 08/06/2022 07:40, Lionel Landwerlin wrote:

On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
wrote:

On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

On Wed, 1 Jun 2022 at 11:03, Dave Airlie  wrote:


On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
 wrote:


On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura 
wrote:

>> VM_BIND and related uapi definitions
>>
>> v2: Ensure proper kernel-doc formatting with cross references.
>> Also add new uapi and documentation as per review comments
>> from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura 


>> ---
>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
+++

>>  1 file changed, 399 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
b/Documentation/gpu/rfc/i915_vm_bind.h

>> new file mode 100644
>> index ..589c0a009107
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>> @@ -0,0 +1,399 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2022 Intel Corporation
>> + */
>> +
>> +/**
>> + * DOC: I915_PARAM_HAS_VM_BIND
>> + *
>> + * VM_BIND feature availability.
>> + * See typedef drm_i915_getparam_t param.
>> + */
>> +#define I915_PARAM_HAS_VM_BIND 57
>> +
>> +/**
>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>> + *
>> + * Flag to opt-in for VM_BIND mode of binding during VM 
creation.

>> + * See struct drm_i915_gem_vm_control flags.
>> + *
>> + * A VM in VM_BIND mode will not support the older execbuff 
mode of binding.
>> + * In VM_BIND mode, execbuff ioctl will not accept any 
execlist (ie., the

>> + * _i915_gem_execbuffer2.buffer_count must be 0).
>> + * Also, _i915_gem_execbuffer2.batch_start_offset and
>> + * _i915_gem_execbuffer2.batch_len must be 0.
>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension 
must be provided

>> + * to pass in the batch buffer addresses.
>> + *
>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>> + * I915_EXEC_BATCH_FIRST of _i915_gem_execbuffer2.flags 
must be 0
>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
must always be
>> + * set (See struct 
drm_i915_gem_execbuffer_ext_batch_addresses).
>> + * The buffers_ptr, buffer_count, batch_start_offset and 
batch_len fields
>> + * of struct drm_i915_gem_execbuffer2 are also not used and 
must be 0.

>> + */
>
>From that description, it seems we have:
>
>struct drm_i915_gem_execbuffer2 {
>    __u64 buffers_ptr;  -> must be 0 (new)
>    __u32 buffer_count; -> must be 0 (new)
>    __u32 batch_start_offset;   -> must be 0 (new)
>    __u32 batch_len;    -> must be 0 (new)
>    __u32 DR1;  -> must be 0 (old)
>    __u32 DR4;  -> must be 0 (old)
>    __u32 num_cliprects; (fences)   -> must be 0 since 
using extensions
>    __u64 cliprects_ptr; (fences, extensions) -> contains 
an actual pointer!
>    __u64 flags;    -> some flags must be 0 
(new)

>    __u64 rsvd1; (context info) -> repurposed field (old)
>    __u64 rsvd2;    -> unused
>};
>
>Based on that, why can't we just get drm_i915_gem_execbuffer3 
instead
>of adding even more complexity to an already abused interface? 
While

>the Vulkan-like extension thing is really nice, I don't think what
>we're doing here is extending the ioctl usage, we're completely
>changing how the base struct should be interpreted based on how 
the VM

>was created (which is an entirely different ioctl).
>
>From Rusty Russel's API Design grading, 
drm_i915_gem_execbuffer2 is
>already at -6 without these changes. I think after vm_bind 
we'll need

>to create a -11 entry just to deal with this ioctl.
>

The only change here is removing the execlist support for VM_BIND
mode (other than natual extensions).
Adding a new execbuffer3 was considered, but I think we need to 
be careful
with that as that goes beyond the VM_BIND support, including any 
future

requirements (as we don't want an execbuffer4 after VM_BIND).


Why not? it's not like adding extensions here is really that 
different

than adding new ioctls.

I definitely think this deserves an execbuffer3 without even
considering future requirements. Just  to burn down the old
requirements and pointless fi

Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-06-08 Thread Lionel Landwerlin


On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
wrote:

On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

On Wed, 1 Jun 2022 at 11:03, Dave Airlie  wrote:


On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
 wrote:


On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> VM_BIND and related uapi definitions
>>
>> v2: Ensure proper kernel-doc formatting with cross references.
>> Also add new uapi and documentation as per review comments
>> from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura 


>> ---
>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
+++

>>  1 file changed, 399 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
b/Documentation/gpu/rfc/i915_vm_bind.h

>> new file mode 100644
>> index ..589c0a009107
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>> @@ -0,0 +1,399 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2022 Intel Corporation
>> + */
>> +
>> +/**
>> + * DOC: I915_PARAM_HAS_VM_BIND
>> + *
>> + * VM_BIND feature availability.
>> + * See typedef drm_i915_getparam_t param.
>> + */
>> +#define I915_PARAM_HAS_VM_BIND   57
>> +
>> +/**
>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>> + *
>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>> + * See struct drm_i915_gem_vm_control flags.
>> + *
>> + * A VM in VM_BIND mode will not support the older execbuff 
mode of binding.
>> + * In VM_BIND mode, execbuff ioctl will not accept any 
execlist (ie., the

>> + * _i915_gem_execbuffer2.buffer_count must be 0).
>> + * Also, _i915_gem_execbuffer2.batch_start_offset and
>> + * _i915_gem_execbuffer2.batch_len must be 0.
>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
be provided

>> + * to pass in the batch buffer addresses.
>> + *
>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>> + * I915_EXEC_BATCH_FIRST of _i915_gem_execbuffer2.flags 
must be 0
>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
must always be

>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> + * The buffers_ptr, buffer_count, batch_start_offset and 
batch_len fields
>> + * of struct drm_i915_gem_execbuffer2 are also not used and 
must be 0.

>> + */
>
>From that description, it seems we have:
>
>struct drm_i915_gem_execbuffer2 {
>    __u64 buffers_ptr;  -> must be 0 (new)
>    __u32 buffer_count; -> must be 0 (new)
>    __u32 batch_start_offset;   -> must be 0 (new)
>    __u32 batch_len;    -> must be 0 (new)
>    __u32 DR1;  -> must be 0 (old)
>    __u32 DR4;  -> must be 0 (old)
>    __u32 num_cliprects; (fences)   -> must be 0 since using 
extensions
>    __u64 cliprects_ptr; (fences, extensions) -> contains an 
actual pointer!
>    __u64 flags;    -> some flags must be 0 
(new)

>    __u64 rsvd1; (context info) -> repurposed field (old)
>    __u64 rsvd2;    -> unused
>};
>
>Based on that, why can't we just get drm_i915_gem_execbuffer3 
instead

>of adding even more complexity to an already abused interface? While
>the Vulkan-like extension thing is really nice, I don't think what
>we're doing here is extending the ioctl usage, we're completely
>changing how the base struct should be interpreted based on how 
the VM

>was created (which is an entirely different ioctl).
>
>From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>already at -6 without these changes. I think after vm_bind we'll 
need

>to create a -11 entry just to deal with this ioctl.
>

The only change here is removing the execlist support for VM_BIND
mode (other than natual extensions).
Adding a new execbuffer3 was considered, but I think we need to be 
careful
with that as that goes beyond the VM_BIND support, including any 
future

requirements (as we don't want an execbuffer4 after VM_BIND).


Why not? it's not like adding extensions here is really that different
than adding new ioctls.

I definitely think this deserves an execbuffer3 without even
considering future requirements. Just  to burn down the old
requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
older sw on execbuf2 for ever.


I guess another point in favour of execbuf3 would be that it's less
midlayer. If we share the entry point then there's quite a few vfuncs
needed to cleanly split out the vm_bind paths from the legacy
reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl
vfunc, and then we share code (where it even makes sense, probably
request setup/submit need to be shared, anything else is

Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-06-08 Thread Lionel Landwerlin


On 08/06/2022 09:40, Lionel Landwerlin wrote:

On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
wrote:

On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

On Wed, 1 Jun 2022 at 11:03, Dave Airlie  wrote:


On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
 wrote:


On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> VM_BIND and related uapi definitions
>>
>> v2: Ensure proper kernel-doc formatting with cross references.
>> Also add new uapi and documentation as per review comments
>> from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura 


>> ---
>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
+++

>>  1 file changed, 399 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
b/Documentation/gpu/rfc/i915_vm_bind.h

>> new file mode 100644
>> index ..589c0a009107
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>> @@ -0,0 +1,399 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2022 Intel Corporation
>> + */
>> +
>> +/**
>> + * DOC: I915_PARAM_HAS_VM_BIND
>> + *
>> + * VM_BIND feature availability.
>> + * See typedef drm_i915_getparam_t param.
>> + */
>> +#define I915_PARAM_HAS_VM_BIND 57
>> +
>> +/**
>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>> + *
>> + * Flag to opt-in for VM_BIND mode of binding during VM 
creation.

>> + * See struct drm_i915_gem_vm_control flags.
>> + *
>> + * A VM in VM_BIND mode will not support the older execbuff 
mode of binding.
>> + * In VM_BIND mode, execbuff ioctl will not accept any 
execlist (ie., the

>> + * _i915_gem_execbuffer2.buffer_count must be 0).
>> + * Also, _i915_gem_execbuffer2.batch_start_offset and
>> + * _i915_gem_execbuffer2.batch_len must be 0.
>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
be provided

>> + * to pass in the batch buffer addresses.
>> + *
>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>> + * I915_EXEC_BATCH_FIRST of _i915_gem_execbuffer2.flags 
must be 0
>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
must always be

>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> + * The buffers_ptr, buffer_count, batch_start_offset and 
batch_len fields
>> + * of struct drm_i915_gem_execbuffer2 are also not used and 
must be 0.

>> + */
>
>From that description, it seems we have:
>
>struct drm_i915_gem_execbuffer2 {
>    __u64 buffers_ptr;  -> must be 0 (new)
>    __u32 buffer_count; -> must be 0 (new)
>    __u32 batch_start_offset;   -> must be 0 (new)
>    __u32 batch_len;    -> must be 0 (new)
>    __u32 DR1;  -> must be 0 (old)
>    __u32 DR4;  -> must be 0 (old)
>    __u32 num_cliprects; (fences)   -> must be 0 since using 
extensions
>    __u64 cliprects_ptr; (fences, extensions) -> contains an 
actual pointer!
>    __u64 flags;    -> some flags must be 0 
(new)

>    __u64 rsvd1; (context info) -> repurposed field (old)
>    __u64 rsvd2;    -> unused
>};
>
>Based on that, why can't we just get drm_i915_gem_execbuffer3 
instead
>of adding even more complexity to an already abused interface? 
While

>the Vulkan-like extension thing is really nice, I don't think what
>we're doing here is extending the ioctl usage, we're completely
>changing how the base struct should be interpreted based on how 
the VM

>was created (which is an entirely different ioctl).
>
>From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>already at -6 without these changes. I think after vm_bind we'll 
need

>to create a -11 entry just to deal with this ioctl.
>

The only change here is removing the execlist support for VM_BIND
mode (other than natual extensions).
Adding a new execbuffer3 was considered, but I think we need to 
be careful
with that as that goes beyond the VM_BIND support, including any 
future

requirements (as we don't want an execbuffer4 after VM_BIND).


Why not? it's not like adding extensions here is really that 
different

than adding new ioctls.

I definitely think this deserves an execbuffer3 without even
considering future requirements. Just  to burn down the old
requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs,

Re: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

2022-06-08 Thread Lionel Landwerlin


On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura 
wrote:

On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

On Wed, 1 Jun 2022 at 11:03, Dave Airlie  wrote:


On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura
 wrote:


On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
>> VM_BIND and related uapi definitions
>>
>> v2: Ensure proper kernel-doc formatting with cross references.
>> Also add new uapi and documentation as per review comments
>> from Daniel.
>>
>> Signed-off-by: Niranjana Vishwanathapura 


>> ---
>>  Documentation/gpu/rfc/i915_vm_bind.h | 399 
+++

>>  1 file changed, 399 insertions(+)
>>  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>
>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
b/Documentation/gpu/rfc/i915_vm_bind.h

>> new file mode 100644
>> index ..589c0a009107
>> --- /dev/null
>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>> @@ -0,0 +1,399 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2022 Intel Corporation
>> + */
>> +
>> +/**
>> + * DOC: I915_PARAM_HAS_VM_BIND
>> + *
>> + * VM_BIND feature availability.
>> + * See typedef drm_i915_getparam_t param.
>> + */
>> +#define I915_PARAM_HAS_VM_BIND   57
>> +
>> +/**
>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>> + *
>> + * Flag to opt-in for VM_BIND mode of binding during VM creation.
>> + * See struct drm_i915_gem_vm_control flags.
>> + *
>> + * A VM in VM_BIND mode will not support the older execbuff 
mode of binding.
>> + * In VM_BIND mode, execbuff ioctl will not accept any 
execlist (ie., the

>> + * _i915_gem_execbuffer2.buffer_count must be 0).
>> + * Also, _i915_gem_execbuffer2.batch_start_offset and
>> + * _i915_gem_execbuffer2.batch_len must be 0.
>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must 
be provided

>> + * to pass in the batch buffer addresses.
>> + *
>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
>> + * I915_EXEC_BATCH_FIRST of _i915_gem_execbuffer2.flags 
must be 0
>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag 
must always be

>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
>> + * The buffers_ptr, buffer_count, batch_start_offset and 
batch_len fields
>> + * of struct drm_i915_gem_execbuffer2 are also not used and 
must be 0.

>> + */
>
>From that description, it seems we have:
>
>struct drm_i915_gem_execbuffer2 {
>    __u64 buffers_ptr;  -> must be 0 (new)
>    __u32 buffer_count; -> must be 0 (new)
>    __u32 batch_start_offset;   -> must be 0 (new)
>    __u32 batch_len;    -> must be 0 (new)
>    __u32 DR1;  -> must be 0 (old)
>    __u32 DR4;  -> must be 0 (old)
>    __u32 num_cliprects; (fences)   -> must be 0 since using 
extensions
>    __u64 cliprects_ptr; (fences, extensions) -> contains an 
actual pointer!
>    __u64 flags;    -> some flags must be 0 
(new)

>    __u64 rsvd1; (context info) -> repurposed field (old)
>    __u64 rsvd2;    -> unused
>};
>
>Based on that, why can't we just get drm_i915_gem_execbuffer3 
instead

>of adding even more complexity to an already abused interface? While
>the Vulkan-like extension thing is really nice, I don't think what
>we're doing here is extending the ioctl usage, we're completely
>changing how the base struct should be interpreted based on how 
the VM

>was created (which is an entirely different ioctl).
>
>From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is
>already at -6 without these changes. I think after vm_bind we'll 
need

>to create a -11 entry just to deal with this ioctl.
>

The only change here is removing the execlist support for VM_BIND
mode (other than natual extensions).
Adding a new execbuffer3 was considered, but I think we need to be 
careful
with that as that goes beyond the VM_BIND support, including any 
future

requirements (as we don't want an execbuffer4 after VM_BIND).


Why not? it's not like adding extensions here is really that different
than adding new ioctls.

I definitely think this deserves an execbuffer3 without even
considering future requirements. Just  to burn down the old
requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the
older sw on execbuf2 for ever.


I guess another point in favour of execbuf3 would be that it's less
midlayer. If we share the entry point then there's quite a few vfuncs
needed to cleanly split out the vm_bind paths from the legacy
reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl
vfunc, and then we share code (where it even makes sense, probably
request setup/submit need to be shared, anything else is

Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

2022-06-03 Thread Lionel Landwerlin

On 02/06/2022 23:35, Jason Ekstrand wrote:
On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura 
 wrote:

On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>> > +VM_BIND/UNBIND ioctl will immediately start
binding/unbinding the mapping in an
>> > +async worker. The binding and unbinding will work like a
special GPU engine.
>> > +The binding and unbinding operations are serialized and will
wait on specified
>> > +input fences before the operation and will signal the output
fences upon the
>> > +completion of the operation. Due to serialization,
completion of an operation
>> > +will also indicate that all previous operations are also
complete.
>>
>> I guess we should avoid saying "will immediately start
binding/unbinding" if
>> there are fences involved.
>>
>> And the fact that it's happening in an async worker seem to
imply it's not
>> immediate.
>>

Ok, will fix.
This was added because in earlier design binding was deferred
until next execbuff.
But now it is non-deferred (immediate in that sense). But yah,
this is confusing
and will fix it.

>>
>> I have a question on the behavior of the bind operation when no
input fence
>> is provided. Let say I do :
>>
>> VM_BIND (out_fence=fence1)
>>
>> VM_BIND (out_fence=fence2)
>>
>> VM_BIND (out_fence=fence3)
>>
>>
>> In what order are the fences going to be signaled?
>>
>> In the order of VM_BIND ioctls? Or out of order?
>>
>> Because you wrote "serialized I assume it's : in order
>>

Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
unbind will use
the same queue and hence are ordered.

>>
>> One thing I didn't realize is that because we only get one
"VM_BIND" engine,
>> there is a disconnect from the Vulkan specification.
>>
>> In Vulkan VM_BIND operations are serialized but per engine.
>>
>> So you could have something like this :
>>
>> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>>
>> fence1 is not signaled
>>
>> fence3 is signaled
>>
>> So the second VM_BIND will proceed before the first VM_BIND.
>>
>>
>> I guess we can deal with that scenario in userspace by doing
the wait
>> ourselves in one thread per engines.
>>
>> But then it makes the VM_BIND input fences useless.
>>
>>
>> Daniel : what do you think? Should be rework this or just deal
with wait
>> fences in userspace?
>>
>
>My opinion is rework this but make the ordering via an engine
param optional.
>
>e.g. A VM can be configured so all binds are ordered within the VM
>
>e.g. A VM can be configured so all binds accept an engine
argument (in
>the case of the i915 likely this is a gem context handle) and binds
>ordered with respect to that engine.
>
>This gives UMDs options as the later likely consumes more KMD
resources
>so if a different UMD can live with binds being ordered within the VM
>they can use a mode consuming less resources.
>

I think we need to be careful here if we are looking for some out of
(submission) order completion of vm_bind/unbind.
In-order completion means, in a batch of binds and unbinds to be
completed in-order, user only needs to specify in-fence for the
first bind/unbind call and the our-fence for the last bind/unbind
call. Also, the VA released by an unbind call can be re-used by
any subsequent bind call in that in-order batch.

These things will break if binding/unbinding were to be allowed to
go out of order (of submission) and user need to be extra careful
not to run into pre-mature triggereing of out-fence and bind failing
as VA is still in use etc.

Also, VM_BIND binds the provided mapping on the specified address
space
(VM). So, the uapi is not engine/context specific.

We can however add a 'queue' to the uapi which can be one from the
pre-defined queues,
I915_VM_BIND_QUEUE_0
I915_VM_BIND_QUEUE_1
...
I915_VM_BIND_QUEUE_(N-1)

KMD will spawn an async work queue for each queue which will only
bind the mappings on that queue in

Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

2022-06-01 Thread Lionel Landwerlin


On 02/06/2022 00:18, Matthew Brost wrote:

On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:

On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:

+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
+async worker. The binding and unbinding will work like a special GPU engine.
+The binding and unbinding operations are serialized and will wait on specified
+input fences before the operation and will signal the output fences upon the
+completion of the operation. Due to serialization, completion of an operation
+will also indicate that all previous operations are also complete.

I guess we should avoid saying "will immediately start binding/unbinding" if
there are fences involved.

And the fact that it's happening in an async worker seem to imply it's not
immediate.


I have a question on the behavior of the bind operation when no input fence
is provided. Let say I do :

VM_BIND (out_fence=fence1)

VM_BIND (out_fence=fence2)

VM_BIND (out_fence=fence3)


In what order are the fences going to be signaled?

In the order of VM_BIND ioctls? Or out of order?

Because you wrote "serialized I assume it's : in order


One thing I didn't realize is that because we only get one "VM_BIND" engine,
there is a disconnect from the Vulkan specification.

In Vulkan VM_BIND operations are serialized but per engine.

So you could have something like this :

VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)


Question - let's say this done after the above operations:

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)

Is the exec ordered with respected to bind (i.e. would fence3 & 4 be
signaled before the exec starts)?

Matt



Hi Matt,

From the vulkan point of view, everything is serialized within an 
engine (we map that to a VkQueue).


So with :

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

EXEC completes first then VM_BIND executes.


To be even clearer :

EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)


EXEC will wait until fence2 is signaled.
Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, 
VM_BIND executes.

It would kind of like having the VM_BIND operation be another batch executed 
from the ringbuffer buffer.

-Lionel





fence1 is not signaled

fence3 is signaled

So the second VM_BIND will proceed before the first VM_BIND.


I guess we can deal with that scenario in userspace by doing the wait
ourselves in one thread per engines.

But then it makes the VM_BIND input fences useless.


Daniel : what do you think? Should be rework this or just deal with wait
fences in userspace?


Sorry I noticed this late.


-Lionel

Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

2022-06-01 Thread Lionel Landwerlin


On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:

+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
+async worker. The binding and unbinding will work like a special GPU engine.
+The binding and unbinding operations are serialized and will wait on specified
+input fences before the operation and will signal the output fences upon the
+completion of the operation. Due to serialization, completion of an operation
+will also indicate that all previous operations are also complete.


I guess we should avoid saying "will immediately start 
binding/unbinding" if there are fences involved.


And the fact that it's happening in an async worker seem to imply it's 
not immediate.



I have a question on the behavior of the bind operation when no input 
fence is provided. Let say I do :


VM_BIND (out_fence=fence1)

VM_BIND (out_fence=fence2)

VM_BIND (out_fence=fence3)


In what order are the fences going to be signaled?

In the order of VM_BIND ioctls? Or out of order?

Because you wrote "serialized I assume it's : in order


One thing I didn't realize is that because we only get one "VM_BIND" 
engine, there is a disconnect from the Vulkan specification.


In Vulkan VM_BIND operations are serialized but per engine.

So you could have something like this :

VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)


fence1 is not signaled

fence3 is signaled

So the second VM_BIND will proceed before the first VM_BIND.


I guess we can deal with that scenario in userspace by doing the wait 
ourselves in one thread per engines.


But then it makes the VM_BIND input fences useless.


Daniel : what do you think? Should be rework this or just deal with wait 
fences in userspace?



Sorry I noticed this late.


-Lionel

Re: [Intel-gfx] [PATCH v2 2/6] drm/i915/xehp: Drop GETPARAM lookups of I915_PARAM_[SUB]SLICE_MASK

2022-06-01 Thread Lionel Landwerlin


On 17/05/2022 06:20, Matt Roper wrote:

Slice/subslice/EU information should be obtained via the topology
queries provided by the I915_QUERY interface; let's turn off support for
the old GETPARAM lookups on Xe_HP and beyond where we can't return
meaningful values.

The slice mask lookup is meaningless since Xe_HP doesn't support
traditional slices (and we make no attempt to return the various new
units like gslices, cslices, mslices, etc.) here.

The subslice mask lookup is even more problematic; given the distinct
masks for geometry vs compute purposes, the combined mask returned here
is likely not what userspace would want to act upon anyway.  The value
is also limited to 32-bits by the nature of the GETPARAM ioctl which is
sufficient for the initial Xe_HP platforms, but is unable to convey the
larger masks that will be needed on other upcoming platforms.  Finally,
the value returned here becomes even less meaningful when used on
multi-tile platforms where each tile will have its own masks.

Signed-off-by: Matt Roper 



Sounds fair. We've been relying on the topology query in Mesa since it's 
available and it's a requirement for Gfx10+.


FYI, we're also not using I915_PARAM_EU_TOTAL on Gfx10+ for the same reason.


Acked-by: Lionel Landwerlin 



---
  drivers/gpu/drm/i915/i915_getparam.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_getparam.c 
b/drivers/gpu/drm/i915/i915_getparam.c
index c12a0adefda5..ac9767c56619 100644
--- a/drivers/gpu/drm/i915/i915_getparam.c
+++ b/drivers/gpu/drm/i915/i915_getparam.c
@@ -148,11 +148,19 @@ int i915_getparam_ioctl(struct drm_device *dev, void 
*data,
value = intel_engines_has_context_isolation(i915);
break;
case I915_PARAM_SLICE_MASK:
+   /* Not supported from Xe_HP onward; use topology queries */
+   if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50))
+   return -EINVAL;
+
value = sseu->slice_mask;
if (!value)
return -ENODEV;
break;
case I915_PARAM_SUBSLICE_MASK:
+   /* Not supported from Xe_HP onward; use topology queries */
+   if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50))
+   return -EINVAL;
+
/* Only copy bits from the first slice */
memcpy(, sseu->subslice_mask,
   min(sseu->ss_stride, (u8)sizeof(value)));

Re: [Intel-gfx] [PATCH] drm/syncobj: flatten dma_fence_chains on transfer

2022-05-30 Thread Lionel Landwerlin


On 30/05/2022 14:40, Christian König wrote:

Am 30.05.22 um 12:09 schrieb Lionel Landwerlin:

On 30/05/2022 12:52, Christian König wrote:

Am 25.05.22 um 23:59 schrieb Lucas De Marchi:

On Wed, May 25, 2022 at 12:38:51PM +0200, Christian König wrote:

Am 25.05.22 um 11:35 schrieb Lionel Landwerlin:

[SNIP]

Err... Let's double check with my colleagues.

It seems we're running into a test failure in IGT with this 
patch, but now I have doubts that it's where the problem lies.


Yeah, exactly that's what I couldn't understand as well.

What you describe above should still work fine.

Thanks for taking a look into this,
Christian.


With some additional prints:

[  210.742634] Console: switching to colour dummy device 80x25
[  210.742686] [IGT] syncobj_timeline: executing
[  210.756988] [IGT] syncobj_timeline: starting subtest 
transfer-timeline-point
[  210.757364] [drm:drm_syncobj_transfer_ioctl] *ERROR* adding 
fence0 signaled=1
[  210.764543] [drm:drm_syncobj_transfer_ioctl] *ERROR* resulting 
array fence signaled=0

[  210.800469] [IGT] syncobj_timeline: exiting, ret=98
[  210.825426] Console: switching to colour frame buffer device 240x67


still learning this part of the code but AFAICS the problem is because
when we are creating the array, the 'signaled' doesn't propagate to 
the

array.


Yeah, but that is intentionally. The array should only signal when 
requested.


I still don't get what the test case here is checking.



There must be something I don't know about fence arrays.

You seem to say that creating an array of signaled fences will not 
make the array signaled.


Exactly that, yes. The array delays it's signaling until somebody asks 
for it.


In other words the fences inside the array are check only after 
someone calls dma_fence_enable_sw_signaling() which in turn calls 
dma_fence_array_enable_signaling().


It is certainly possible that nobody does that in the drm_syncobj and 
because of this the array never signals.


Regards,
Christian.



Thanks,


Yeah I guess dma_fence_enable_sw_signaling() is never called for sw_sync.

Don't we also want to call it right at the end of 
drm_syncobj_flatten_chain() ?



-Lionel







This is the situation with this IGT test.

We started with a syncobj with point 1 & 2 signaled.

We take point 2 and import it as a new point 3 on the same syncobj.

We expect point 3 to be signaled as well and it's not.


Thanks,


-Lionel




Regards,
Christian.



dma_fence_array_create() {
...
atomic_set(>num_pending, signal_on_any ? 1 : num_fences);
...
}

This is not considering the fact that some of the fences could already
have been signaled as is the case in the 
igt@syncobj_timeline@transfer-timeline-point
test. See 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11693/shard-dg1-12/igt@syncobj_timel...@transfer-timeline-point.html


Quick patch on this function fixes it for me:

-8<
Subject: [PATCH] dma-buf: Honor already signaled fences on array 
creation


When creating an array, array->num_pending is marked with the 
number of

fences. However the fences could alredy have been signaled. Propagate
num_pending to the array by looking at each individual fence the array
contains.

Signed-off-by: Lucas De Marchi 
---
 drivers/dma-buf/dma-fence-array.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/dma-buf/dma-fence-array.c 
b/drivers/dma-buf/dma-fence-array.c

index 5c8a7084577b..32f491c32fa0 100644
--- a/drivers/dma-buf/dma-fence-array.c
+++ b/drivers/dma-buf/dma-fence-array.c
@@ -158,6 +158,8 @@ struct dma_fence_array 
*dma_fence_array_create(int num_fences,

 {
 struct dma_fence_array *array;
 size_t size = sizeof(*array);
+    unsigned num_pending = 0;
+    struct dma_fence **f;

 WARN_ON(!num_fences || !fences);

@@ -173,7 +175,14 @@ struct dma_fence_array 
*dma_fence_array_create(int num_fences,

 init_irq_work(>work, irq_dma_fence_array_work);

 array->num_fences = num_fences;
-    atomic_set(>num_pending, signal_on_any ? 1 : num_fences);
+
+    for (f = fences; f < fences + num_fences; f++)
+    num_pending += !dma_fence_is_signaled(*f);
+
+    if (signal_on_any)
+    num_pending = !!num_pending;
+
+    atomic_set(>num_pending, num_pending);
 array->fences = fences;

 array->base.error = PENDING_ERROR;

Re: [Intel-gfx] [PATCH] drm/syncobj: flatten dma_fence_chains on transfer

2022-05-30 Thread Lionel Landwerlin


On 30/05/2022 12:52, Christian König wrote:

Am 25.05.22 um 23:59 schrieb Lucas De Marchi:

On Wed, May 25, 2022 at 12:38:51PM +0200, Christian König wrote:

Am 25.05.22 um 11:35 schrieb Lionel Landwerlin:

[SNIP]

Err... Let's double check with my colleagues.

It seems we're running into a test failure in IGT with this patch, 
but now I have doubts that it's where the problem lies.


Yeah, exactly that's what I couldn't understand as well.

What you describe above should still work fine.

Thanks for taking a look into this,
Christian.


With some additional prints:

[  210.742634] Console: switching to colour dummy device 80x25
[  210.742686] [IGT] syncobj_timeline: executing
[  210.756988] [IGT] syncobj_timeline: starting subtest 
transfer-timeline-point
[  210.757364] [drm:drm_syncobj_transfer_ioctl] *ERROR* adding fence0 
signaled=1
[  210.764543] [drm:drm_syncobj_transfer_ioctl] *ERROR* resulting 
array fence signaled=0

[  210.800469] [IGT] syncobj_timeline: exiting, ret=98
[  210.825426] Console: switching to colour frame buffer device 240x67


still learning this part of the code but AFAICS the problem is because
when we are creating the array, the 'signaled' doesn't propagate to the
array.


Yeah, but that is intentionally. The array should only signal when 
requested.


I still don't get what the test case here is checking.



There must be something I don't know about fence arrays.

You seem to say that creating an array of signaled fences will not make 
the array signaled.



This is the situation with this IGT test.

We started with a syncobj with point 1 & 2 signaled.

We take point 2 and import it as a new point 3 on the same syncobj.

We expect point 3 to be signaled as well and it's not.


Thanks,


-Lionel




Regards,
Christian.



dma_fence_array_create() {
...
atomic_set(>num_pending, signal_on_any ? 1 : num_fences);
...
}

This is not considering the fact that some of the fences could already
have been signaled as is the case in the 
igt@syncobj_timeline@transfer-timeline-point
test. See 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11693/shard-dg1-12/igt@syncobj_timel...@transfer-timeline-point.html


Quick patch on this function fixes it for me:

-8<
Subject: [PATCH] dma-buf: Honor already signaled fences on array 
creation


When creating an array, array->num_pending is marked with the number of
fences. However the fences could alredy have been signaled. Propagate
num_pending to the array by looking at each individual fence the array
contains.

Signed-off-by: Lucas De Marchi 
---
 drivers/dma-buf/dma-fence-array.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/dma-buf/dma-fence-array.c 
b/drivers/dma-buf/dma-fence-array.c

index 5c8a7084577b..32f491c32fa0 100644
--- a/drivers/dma-buf/dma-fence-array.c
+++ b/drivers/dma-buf/dma-fence-array.c
@@ -158,6 +158,8 @@ struct dma_fence_array 
*dma_fence_array_create(int num_fences,

 {
 struct dma_fence_array *array;
 size_t size = sizeof(*array);
+    unsigned num_pending = 0;
+    struct dma_fence **f;

 WARN_ON(!num_fences || !fences);

@@ -173,7 +175,14 @@ struct dma_fence_array 
*dma_fence_array_create(int num_fences,

 init_irq_work(>work, irq_dma_fence_array_work);

 array->num_fences = num_fences;
-    atomic_set(>num_pending, signal_on_any ? 1 : num_fences);
+
+    for (f = fences; f < fences + num_fences; f++)
+    num_pending += !dma_fence_is_signaled(*f);
+
+    if (signal_on_any)
+    num_pending = !!num_pending;
+
+    atomic_set(>num_pending, num_pending);
 array->fences = fences;

 array->base.error = PENDING_ERROR;

Re: [Intel-gfx] [PATCH] drm/syncobj: flatten dma_fence_chains on transfer

2022-05-25 Thread Lionel Landwerlin


On 25/05/2022 12:26, Lionel Landwerlin wrote:

On 25/05/2022 11:24, Christian König wrote:

Am 25.05.22 um 08:47 schrieb Lionel Landwerlin:

On 09/02/2022 20:26, Christian König wrote:

It is illegal to add a dma_fence_chain as timeline point. Flatten out
the fences into a dma_fence_array instead.

Signed-off-by: Christian König 
---
  drivers/gpu/drm/drm_syncobj.c | 61 
---

  1 file changed, 56 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/drm_syncobj.c 
b/drivers/gpu/drm/drm_syncobj.c

index c313a5b4549c..7e48dcd1bee4 100644
--- a/drivers/gpu/drm/drm_syncobj.c
+++ b/drivers/gpu/drm/drm_syncobj.c
@@ -853,12 +853,57 @@ drm_syncobj_fd_to_handle_ioctl(struct 
drm_device *dev, void *data,

  >handle);
  }
  +
+/*
+ * Try to flatten a dma_fence_chain into a dma_fence_array so that 
it can be

+ * added as timeline fence to a chain again.
+ */
+static int drm_syncobj_flatten_chain(struct dma_fence **f)
+{
+    struct dma_fence_chain *chain = to_dma_fence_chain(*f);
+    struct dma_fence *tmp, **fences;
+    struct dma_fence_array *array;
+    unsigned int count;
+
+    if (!chain)
+    return 0;
+
+    count = 0;
+    dma_fence_chain_for_each(tmp, >base)
+    ++count;
+
+    fences = kmalloc_array(count, sizeof(*fences), GFP_KERNEL);
+    if (!fences)
+    return -ENOMEM;
+
+    count = 0;
+    dma_fence_chain_for_each(tmp, >base)
+    fences[count++] = dma_fence_get(tmp);
+
+    array = dma_fence_array_create(count, fences,
+   dma_fence_context_alloc(1),



Hi Christian,


Sorry for the late answer to this.


It appears this commit is trying to remove the warnings added by 
"dma-buf: Warn about dma_fence_chain container rules"


Yes, correct. We are now enforcing some rules with warnings and this 
here bubbled up.




But the context allocation you added just above is breaking some 
tests. In particular igt@syncobj_timeline@transfer-timeline-point


That test transfer points into the timeline at point 3 and expects 
that we'll still on the previous points to complete.


Hui what? I don't understand the problem you are seeing here. What 
exactly is the test doing?




In my opinion we should be reusing the previous context number if 
there is one and only allocate if we don't have a point.


Scratching my head what you mean with that. The functionality 
transfers a synchronization fence from one timeline to another.


So as far as I can see the new point should be part of the timeline 
of the syncobj we are transferring to.


If the application wants to not depend on previous points for wait 
operations, it can reset the syncobj prior to adding a new point.


Well we should never lose synchronization. So what happens is that 
when we do the transfer all the fences of the source are flattened 
out into an array. And that array is then added as new point into the 
destination timeline.



In this case would be broken :


syncobjA <- signal point 1

syncobjA <- import syncobjB point 1 into syncobjA point 2

syncobjA <- query returns 0


-Lionel



Err... Let's double check with my colleagues.

It seems we're running into a test failure in IGT with this patch, but 
now I have doubts that it's where the problem lies.



-Lionel







Where exactly is the problem?

Regards,
Christian.





Cheers,


-Lionel




+   1, false);
+    if (!array)
+    goto free_fences;
+
+    dma_fence_put(*f);
+    *f = >base;
+    return 0;
+
+free_fences:
+    while (count--)
+    dma_fence_put(fences[count]);
+
+    kfree(fences);
+    return -ENOMEM;
+}
+
  static int drm_syncobj_transfer_to_timeline(struct drm_file 
*file_private,

  struct drm_syncobj_transfer *args)
  {
  struct drm_syncobj *timeline_syncobj = NULL;
-    struct dma_fence *fence;
  struct dma_fence_chain *chain;
+    struct dma_fence *fence;
  int ret;
    timeline_syncobj = drm_syncobj_find(file_private, 
args->dst_handle);
@@ -869,16 +914,22 @@ static int 
drm_syncobj_transfer_to_timeline(struct drm_file *file_private,

   args->src_point, args->flags,
   );
  if (ret)
-    goto err;
+    goto err_put_timeline;
+
+    ret = drm_syncobj_flatten_chain();
+    if (ret)
+    goto err_free_fence;
+
  chain = dma_fence_chain_alloc();
  if (!chain) {
  ret = -ENOMEM;
-    goto err1;
+    goto err_free_fence;
  }
+
  drm_syncobj_add_point(timeline_syncobj, chain, fence, 
args->dst_point);

-err1:
+err_free_fence:
  dma_fence_put(fence);
-err:
+err_put_timeline:
  drm_syncobj_put(timeline_syncobj);
    return ret;

Re: [Intel-gfx] [PATCH] drm/syncobj: flatten dma_fence_chains on transfer

2022-05-25 Thread Lionel Landwerlin


On 25/05/2022 11:24, Christian König wrote:

Am 25.05.22 um 08:47 schrieb Lionel Landwerlin:

On 09/02/2022 20:26, Christian König wrote:

It is illegal to add a dma_fence_chain as timeline point. Flatten out
the fences into a dma_fence_array instead.

Signed-off-by: Christian König 
---
  drivers/gpu/drm/drm_syncobj.c | 61 
---

  1 file changed, 56 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/drm_syncobj.c 
b/drivers/gpu/drm/drm_syncobj.c

index c313a5b4549c..7e48dcd1bee4 100644
--- a/drivers/gpu/drm/drm_syncobj.c
+++ b/drivers/gpu/drm/drm_syncobj.c
@@ -853,12 +853,57 @@ drm_syncobj_fd_to_handle_ioctl(struct 
drm_device *dev, void *data,

  >handle);
  }
  +
+/*
+ * Try to flatten a dma_fence_chain into a dma_fence_array so that 
it can be

+ * added as timeline fence to a chain again.
+ */
+static int drm_syncobj_flatten_chain(struct dma_fence **f)
+{
+    struct dma_fence_chain *chain = to_dma_fence_chain(*f);
+    struct dma_fence *tmp, **fences;
+    struct dma_fence_array *array;
+    unsigned int count;
+
+    if (!chain)
+    return 0;
+
+    count = 0;
+    dma_fence_chain_for_each(tmp, >base)
+    ++count;
+
+    fences = kmalloc_array(count, sizeof(*fences), GFP_KERNEL);
+    if (!fences)
+    return -ENOMEM;
+
+    count = 0;
+    dma_fence_chain_for_each(tmp, >base)
+    fences[count++] = dma_fence_get(tmp);
+
+    array = dma_fence_array_create(count, fences,
+   dma_fence_context_alloc(1),



Hi Christian,


Sorry for the late answer to this.


It appears this commit is trying to remove the warnings added by 
"dma-buf: Warn about dma_fence_chain container rules"


Yes, correct. We are now enforcing some rules with warnings and this 
here bubbled up.




But the context allocation you added just above is breaking some 
tests. In particular igt@syncobj_timeline@transfer-timeline-point


That test transfer points into the timeline at point 3 and expects 
that we'll still on the previous points to complete.


Hui what? I don't understand the problem you are seeing here. What 
exactly is the test doing?




In my opinion we should be reusing the previous context number if 
there is one and only allocate if we don't have a point.


Scratching my head what you mean with that. The functionality 
transfers a synchronization fence from one timeline to another.


So as far as I can see the new point should be part of the timeline of 
the syncobj we are transferring to.


If the application wants to not depend on previous points for wait 
operations, it can reset the syncobj prior to adding a new point.


Well we should never lose synchronization. So what happens is that 
when we do the transfer all the fences of the source are flattened out 
into an array. And that array is then added as new point into the 
destination timeline.



In this case would be broken :


syncobjA <- signal point 1

syncobjA <- import syncobjB point 1 into syncobjA point 2

syncobjA <- query returns 0


-Lionel




Where exactly is the problem?

Regards,
Christian.





Cheers,


-Lionel




+   1, false);
+    if (!array)
+    goto free_fences;
+
+    dma_fence_put(*f);
+    *f = >base;
+    return 0;
+
+free_fences:
+    while (count--)
+    dma_fence_put(fences[count]);
+
+    kfree(fences);
+    return -ENOMEM;
+}
+
  static int drm_syncobj_transfer_to_timeline(struct drm_file 
*file_private,

  struct drm_syncobj_transfer *args)
  {
  struct drm_syncobj *timeline_syncobj = NULL;
-    struct dma_fence *fence;
  struct dma_fence_chain *chain;
+    struct dma_fence *fence;
  int ret;
    timeline_syncobj = drm_syncobj_find(file_private, 
args->dst_handle);
@@ -869,16 +914,22 @@ static int 
drm_syncobj_transfer_to_timeline(struct drm_file *file_private,

   args->src_point, args->flags,
   );
  if (ret)
-    goto err;
+    goto err_put_timeline;
+
+    ret = drm_syncobj_flatten_chain();
+    if (ret)
+    goto err_free_fence;
+
  chain = dma_fence_chain_alloc();
  if (!chain) {
  ret = -ENOMEM;
-    goto err1;
+    goto err_free_fence;
  }
+
  drm_syncobj_add_point(timeline_syncobj, chain, fence, 
args->dst_point);

-err1:
+err_free_fence:
  dma_fence_put(fence);
-err:
+err_put_timeline:
  drm_syncobj_put(timeline_syncobj);
    return ret;

Re: [Intel-gfx] [PATCH] drm/syncobj: flatten dma_fence_chains on transfer

2022-05-25 Thread Lionel Landwerlin


On 09/02/2022 20:26, Christian König wrote:

It is illegal to add a dma_fence_chain as timeline point. Flatten out
the fences into a dma_fence_array instead.

Signed-off-by: Christian König 
---
  drivers/gpu/drm/drm_syncobj.c | 61 ---
  1 file changed, 56 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
index c313a5b4549c..7e48dcd1bee4 100644
--- a/drivers/gpu/drm/drm_syncobj.c
+++ b/drivers/gpu/drm/drm_syncobj.c
@@ -853,12 +853,57 @@ drm_syncobj_fd_to_handle_ioctl(struct drm_device *dev, 
void *data,
>handle);
  }
  
+

+/*
+ * Try to flatten a dma_fence_chain into a dma_fence_array so that it can be
+ * added as timeline fence to a chain again.
+ */
+static int drm_syncobj_flatten_chain(struct dma_fence **f)
+{
+   struct dma_fence_chain *chain = to_dma_fence_chain(*f);
+   struct dma_fence *tmp, **fences;
+   struct dma_fence_array *array;
+   unsigned int count;
+
+   if (!chain)
+   return 0;
+
+   count = 0;
+   dma_fence_chain_for_each(tmp, >base)
+   ++count;
+
+   fences = kmalloc_array(count, sizeof(*fences), GFP_KERNEL);
+   if (!fences)
+   return -ENOMEM;
+
+   count = 0;
+   dma_fence_chain_for_each(tmp, >base)
+   fences[count++] = dma_fence_get(tmp);
+
+   array = dma_fence_array_create(count, fences,
+  dma_fence_context_alloc(1),



Hi Christian,


Sorry for the late answer to this.


It appears this commit is trying to remove the warnings added by 
"dma-buf: Warn about dma_fence_chain container rules"


But the context allocation you added just above is breaking some tests. 
In particular igt@syncobj_timeline@transfer-timeline-point


That test transfer points into the timeline at point 3 and expects that 
we'll still on the previous points to complete.



In my opinion we should be reusing the previous context number if there 
is one and only allocate if we don't have a point.


If the application wants to not depend on previous points for wait 
operations, it can reset the syncobj prior to adding a new point.



Cheers,


-Lionel




+  1, false);
+   if (!array)
+   goto free_fences;
+
+   dma_fence_put(*f);
+   *f = >base;
+   return 0;
+
+free_fences:
+   while (count--)
+   dma_fence_put(fences[count]);
+
+   kfree(fences);
+   return -ENOMEM;
+}
+
  static int drm_syncobj_transfer_to_timeline(struct drm_file *file_private,
struct drm_syncobj_transfer *args)
  {
struct drm_syncobj *timeline_syncobj = NULL;
-   struct dma_fence *fence;
struct dma_fence_chain *chain;
+   struct dma_fence *fence;
int ret;
  
  	timeline_syncobj = drm_syncobj_find(file_private, args->dst_handle);

@@ -869,16 +914,22 @@ static int drm_syncobj_transfer_to_timeline(struct 
drm_file *file_private,
 args->src_point, args->flags,
 );
if (ret)
-   goto err;
+   goto err_put_timeline;
+
+   ret = drm_syncobj_flatten_chain();
+   if (ret)
+   goto err_free_fence;
+
chain = dma_fence_chain_alloc();
if (!chain) {
ret = -ENOMEM;
-   goto err1;
+   goto err_free_fence;
}
+
drm_syncobj_add_point(timeline_syncobj, chain, fence, args->dst_point);
-err1:
+err_free_fence:
dma_fence_put(fence);
-err:
+err_put_timeline:
drm_syncobj_put(timeline_syncobj);
  
  	return ret;

Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

2022-05-24 Thread Lionel Landwerlin


On 20/05/2022 01:52, Zanoni, Paulo R wrote:

On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:

VM_BIND design document with description of intended use cases.

v2: Add more documentation and format as per review comments
 from Daniel.

Signed-off-by: Niranjana Vishwanathapura 
---

diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst 
b/Documentation/gpu/rfc/i915_vm_bind.rst
new file mode 100644
index ..f1be560d313c
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
@@ -0,0 +1,304 @@
+==
+I915 VM_BIND feature design and use cases
+==
+
+VM_BIND feature
+
+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
+objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
+specified address space (VM). These mappings (also referred to as persistent
+mappings) will be persistent across multiple GPU submissions (execbuff calls)
+issued by the UMD, without user having to provide a list of all required
+mappings during each submission (as required by older execbuff mode).
+
+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
+to specify how the binding/unbinding should sync with other operations
+like the GPU job submission. These fences will be timeline 'drm_syncobj's
+for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences).
+For Compute contexts, they will be user/memory fences (See struct
+drm_i915_vm_bind_ext_user_fence).
+
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
+User has to opt-in for VM_BIND mode of binding for an address space (VM)
+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
+
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an
+async worker. The binding and unbinding will work like a special GPU engine.
+The binding and unbinding operations are serialized and will wait on specified
+input fences before the operation and will signal the output fences upon the
+completion of the operation. Due to serialization, completion of an operation
+will also indicate that all previous operations are also complete.
+
+VM_BIND features include:
+
+* Multiple Virtual Address (VA) mappings can map to the same physical pages
+  of an object (aliasing).
+* VA mapping can map to a partial section of the BO (partial binding).
+* Support capture of persistent mappings in the dump upon GPU error.
+* TLB is flushed upon unbind completion. Batching of TLB flushes in some
+  use cases will be helpful.
+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences.
+* Support for userptr gem objects (no special uapi is required for this).
+
+Execbuff ioctl in VM_BIND mode
+---
+The execbuff ioctl handling in VM_BIND mode differs significantly from the
+older method. A VM in VM_BIND mode will not support older execbuff mode of
+binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence,
+no support for implicit sync. It is expected that the below work will be able
+to support requirements of object dependency setting in all use cases:
+
+"dma-buf: Add an API for exporting sync files"
+(https://lwn.net/Articles/859290/)

I would really like to have more details here. The link provided points
to new ioctls and we're not very familiar with those yet, so I think
you should really clarify the interaction between the new additions
here. Having some sample code would be really nice too.

For Mesa at least (and I believe for the other drivers too) we always
have a few exported buffers in every execbuf call, and we rely on the
implicit synchronization provided by execbuf to make sure everything
works. The execbuf ioctl also has some code to flush caches during
implicit synchronization AFAIR, so I would guess we rely on it too and
whatever else the Kernel does. Is that covered by the new ioctls?

In addition, as far as I remember, one of the big improvements of
vm_bind was that it would help reduce ioctl latency and cpu overhead.
But if making execbuf faster comes at the cost of requiring additional
ioctls calls for implicit synchronization, which is required on ever
execbuf call, then I wonder if we'll even get any faster at all.
Comparing old execbuf vs plain new execbuf without the new required
ioctls won't make sense.
But maybe I'm wrong and we won't need to call these new ioctls around
every single execbuf ioctl we submit? Again, more clarification and
some code examples here would be really nice. This is a big change on
an important part of the API, we should clarify the new expected usage.



Hey Paulo,


I think in the case of X11/Wayland, we'll be doing 1 or 2 extra ioctls 
per frame which seems pretty reasonable.


Essentially we need to set the dependencies on the buffer we´re going to 
tell the display engine (gnome-shell/kde/bare-display-hw) to use.



In the Vulkan case, we're

Re: [Intel-gfx] [PATCH v3] drm/doc: add rfc section for small BAR uapi

2022-05-17 Thread Lionel Landwerlin


On 17/05/2022 12:23, Tvrtko Ursulin wrote:


On 17/05/2022 09:55, Lionel Landwerlin wrote:

On 17/05/2022 11:29, Tvrtko Ursulin wrote:


On 16/05/2022 19:11, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

v2:
   - Some spelling fixes and other small tweaks. (Akeem & Thomas)
   - Rework error capture interactions, including no longer needing
 NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
   - Add probed_cpu_visible_size. (Lionel)
v3:
   - Drop the vma query for now.
   - Add unallocated_cpu_visible_size as part of the region query.
   - Improve the docs some more, including documenting the expected
 behaviour on older kernels, since this came up in some offline
 discussion.

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Lionel Landwerlin 
Cc: Tvrtko Ursulin 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jon Bloomfield 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: Akeem G Abodunrin 
Cc: mesa-...@lists.freedesktop.org
---
  Documentation/gpu/rfc/i915_small_bar.h   | 164 
+++

  Documentation/gpu/rfc/i915_small_bar.rst |  47 +++
  Documentation/gpu/rfc/index.rst  |   4 +
  3 files changed, 215 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h 
b/Documentation/gpu/rfc/i915_small_bar.h

new file mode 100644
index ..4079d287750b
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,164 @@
+/**
+ * struct __drm_i915_memory_region_info - Describes one region as 
known to the

+ * driver.
+ *
+ * Note this is using both struct drm_i915_query_item and struct 
drm_i915_query.
+ * For this new query we are adding the new query id 
DRM_I915_QUERY_MEMORY_REGIONS

+ * at _i915_query_item.query_id.
+ */
+struct __drm_i915_memory_region_info {
+    /** @region: The class:instance pair encoding */
+    struct drm_i915_gem_memory_class_instance region;
+
+    /** @rsvd0: MBZ */
+    __u32 rsvd0;
+
+    /** @probed_size: Memory probed by the driver (-1 = unknown) */
+    __u64 probed_size;


Is -1 possible today or when it will be? For system memory it 
appears zeroes are returned today so that has to stay I think. Does 
it effectively mean userspace has to consider both 0 and -1 as 
unknown is the question.



I raised this on v2. As far as I can tell there are no situation 
where we would get -1.


Is it really probed_size=0 on smem?? It's not the case on the 
internal branch.


My bad, I misread the arguments to intel_memory_region_create while 
grepping:


struct intel_memory_region *i915_gem_shmem_setup(struct 
drm_i915_private *i915,

 u16 type, u16 instance)
{
return intel_memory_region_create(i915, 0,
  totalram_pages() << PAGE_SHIFT,
  PAGE_SIZE, 0, 0,
  type, instance,
  _region_ops);

I saw "0, 0" and wrongly assumed that would be the data, since it 
matched with my mental model and the comment against unallocated_size 
saying it's only tracked for device memory.


Although I'd say it is questionable for i915 to return this data. I 
wonder it use case is possible where it would even be wrong but don't 
know. I guess the cat is out of the bag now.



Not sure how questionable that is. There are a bunch of tools reporting 
the amount of memory available (free, top, htop, etc...).


It might not be totalram_pages() but probably something close to it.

Having a non 0 & non -1 value is useful.


-Lionel




If the situation is -1 for unknown and some valid size (not zero) I 
don't think there is a problem here.


Regards,

Tvrtko


Anv is not currently handling that case.


I would very much like to not deal with 0 for smem.

It really makes it easier for userspace rather than having to fish 
information from 2 different places and on top of dealing with 
multiple kernel versions.



-Lionel





+
+    /**
+ * @unallocated_size: Estimate of memory remaining (-1 = unknown)
+ *
+ * Note this is only currently tracked for 
I915_MEMORY_CLASS_DEVICE

+ * regions, and also requires CAP_PERFMON or CAP_SYS_ADMIN to get
+ * reliable accounting. Without this(or if this an older 
kernel) the


s/if this an/if this is an/

Also same question as above about -1.


+ * value here will always match the @probed_size.
+ */
+    __u64 unallocated_size;
+
+    union {
+    /** @rsvd1: MBZ */
+    __u64 rsvd1[8];
+    struct {
+    /**
+ * @probed_cpu_visible_size: Memory probed by the driver
+ * that is CPU accessible. (-1 = unknown).


Also question about -1. In this case this could be done since the 
field is yet to be added but I am curious if it ever can be -1.



+ *
+ * This will be always be <= @probed_size, and the
+

Re: [Intel-gfx] [PATCH v3] drm/doc: add rfc section for small BAR uapi

2022-05-17 Thread Lionel Landwerlin


On 17/05/2022 11:29, Tvrtko Ursulin wrote:


On 16/05/2022 19:11, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

v2:
   - Some spelling fixes and other small tweaks. (Akeem & Thomas)
   - Rework error capture interactions, including no longer needing
 NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
   - Add probed_cpu_visible_size. (Lionel)
v3:
   - Drop the vma query for now.
   - Add unallocated_cpu_visible_size as part of the region query.
   - Improve the docs some more, including documenting the expected
 behaviour on older kernels, since this came up in some offline
 discussion.

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Lionel Landwerlin 
Cc: Tvrtko Ursulin 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jon Bloomfield 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: Akeem G Abodunrin 
Cc: mesa-...@lists.freedesktop.org
---
  Documentation/gpu/rfc/i915_small_bar.h   | 164 +++
  Documentation/gpu/rfc/i915_small_bar.rst |  47 +++
  Documentation/gpu/rfc/index.rst  |   4 +
  3 files changed, 215 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h 
b/Documentation/gpu/rfc/i915_small_bar.h

new file mode 100644
index ..4079d287750b
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,164 @@
+/**
+ * struct __drm_i915_memory_region_info - Describes one region as 
known to the

+ * driver.
+ *
+ * Note this is using both struct drm_i915_query_item and struct 
drm_i915_query.
+ * For this new query we are adding the new query id 
DRM_I915_QUERY_MEMORY_REGIONS

+ * at _i915_query_item.query_id.
+ */
+struct __drm_i915_memory_region_info {
+    /** @region: The class:instance pair encoding */
+    struct drm_i915_gem_memory_class_instance region;
+
+    /** @rsvd0: MBZ */
+    __u32 rsvd0;
+
+    /** @probed_size: Memory probed by the driver (-1 = unknown) */
+    __u64 probed_size;


Is -1 possible today or when it will be? For system memory it appears 
zeroes are returned today so that has to stay I think. Does it 
effectively mean userspace has to consider both 0 and -1 as unknown is 
the question.



I raised this on v2. As far as I can tell there are no situation where 
we would get -1.


Is it really probed_size=0 on smem?? It's not the case on the internal 
branch.


Anv is not currently handling that case.


I would very much like to not deal with 0 for smem.

It really makes it easier for userspace rather than having to fish 
information from 2 different places and on top of dealing with multiple 
kernel versions.



-Lionel





+
+    /**
+ * @unallocated_size: Estimate of memory remaining (-1 = unknown)
+ *
+ * Note this is only currently tracked for I915_MEMORY_CLASS_DEVICE
+ * regions, and also requires CAP_PERFMON or CAP_SYS_ADMIN to get
+ * reliable accounting. Without this(or if this an older kernel) 
the


s/if this an/if this is an/

Also same question as above about -1.


+ * value here will always match the @probed_size.
+ */
+    __u64 unallocated_size;
+
+    union {
+    /** @rsvd1: MBZ */
+    __u64 rsvd1[8];
+    struct {
+    /**
+ * @probed_cpu_visible_size: Memory probed by the driver
+ * that is CPU accessible. (-1 = unknown).


Also question about -1. In this case this could be done since the 
field is yet to be added but I am curious if it ever can be -1.



+ *
+ * This will be always be <= @probed_size, and the
+ * remainder(if there is any) will not be CPU
+ * accessible.
+ *
+ * On systems without small BAR, the @probed_size will
+ * always equal the @probed_cpu_visible_size, since all
+ * of it will be CPU accessible.
+ *
+ * Note that if the value returned here is zero, then
+ * this must be an old kernel which lacks the relevant
+ * small-bar uAPI support(including


I have noticed you prefer no space before parentheses throughout the 
text so I guess it's just my preference to have it. Very nitpicky even 
if I am right so up to you.



+ * I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS), but on
+ * such systems we should never actually end up with a
+ * small BAR configuration, assuming we are able to load
+ * the kernel module. Hence it should be safe to treat
+ * this the same as when @probed_cpu_visible_size ==
+ * @probed_size.
+ */
+    __u64 probed_cpu_visible_size;
+
+    /**
+ * @unallocated_cpu_visible_size: Estimate of CPU
+ * visible memory remaining (-1 = unknown).
+ *
+ * Note this is only currently

Re: [Intel-gfx] [PATCH v3] uapi/drm/i915: Document memory residency and Flat-CCS capability of obj

2022-05-16 Thread Lionel Landwerlin


On 14/05/2022 00:06, Jordan Justen wrote:

On 2022-05-13 05:31:00, Lionel Landwerlin wrote:

On 02/05/2022 17:15, Ramalingam C wrote:

Capture the impact of memory region preference list of the objects, on
their memory residency and Flat-CCS capability.

v2:
Fix the Flat-CCS capability of an obj with {lmem, smem} preference
list [Thomas]
v3:
Reworded the doc [Matt]

Signed-off-by: Ramalingam C
cc: Matthew Auld
cc: Thomas Hellstrom
cc: Daniel Vetter
cc: Jon Bloomfield
cc: Lionel Landwerlin
cc: Kenneth Graunke
cc:mesa-...@lists.freedesktop.org
cc: Jordan Justen
cc: Tony Ye
Reviewed-by: Matthew Auld
---
   include/uapi/drm/i915_drm.h | 16 
   1 file changed, 16 insertions(+)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index a2def7b27009..b7e1c2fe08dc 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -3443,6 +3443,22 @@ struct drm_i915_gem_create_ext {
* At which point we get the object handle in 
_i915_gem_create_ext.handle,
* along with the final object size in _i915_gem_create_ext.size, which
* should account for any rounding up, if required.
+ *
+ * Note that userspace has no means of knowing the current backing region
+ * for objects where @num_regions is larger than one. The kernel will only
+ * ensure that the priority order of the @regions array is honoured, either
+ * when initially placing the object, or when moving memory around due to
+ * memory pressure
+ *
+ * On Flat-CCS capable HW, compression is supported for the objects residing
+ * in I915_MEMORY_CLASS_DEVICE. When such objects (compressed) has other
+ * memory class in @regions and migrated (by I915, due to memory
+ * constrain) to the non I915_MEMORY_CLASS_DEVICE region, then I915 needs to
+ * decompress the content. But I915 dosen't have the required information to
+ * decompress the userspace compressed objects.
+ *
+ * So I915 supports Flat-CCS, only on the objects which can reside only on
+ * I915_MEMORY_CLASS_DEVICE regions.

I think it's fine to assume Flat-CSS surface will always be in lmem.

I see no issue for the Anv Vulkan driver.

Maybe Nanley or Ken can speak for the Iris GL driver?


Acked-by: Jordan Justen

I think Nanley has accounted for this on iris with:

https://gitlab.freedesktop.org/mesa/mesa/-/commit/42a865730ef72574e179b56a314f30fdccc6cba8

-Jordan


Thanks Jordan,


We might want to through in an additional : assert((|flags 
&||BO_ALLOC_SMEM) == 0); in the CCS case

|

|
|

|-Lionel
|

Re: [Intel-gfx] [PATCH v3] uapi/drm/i915: Document memory residency and Flat-CCS capability of obj

2022-05-13 Thread Lionel Landwerlin


On 02/05/2022 17:15, Ramalingam C wrote:

Capture the impact of memory region preference list of the objects, on
their memory residency and Flat-CCS capability.

v2:
   Fix the Flat-CCS capability of an obj with {lmem, smem} preference
   list [Thomas]
v3:
   Reworded the doc [Matt]

Signed-off-by: Ramalingam C 
cc: Matthew Auld 
cc: Thomas Hellstrom 
cc: Daniel Vetter 
cc: Jon Bloomfield 
cc: Lionel Landwerlin 
cc: Kenneth Graunke 
cc: mesa-...@lists.freedesktop.org
cc: Jordan Justen 
cc: Tony Ye 
Reviewed-by: Matthew Auld 
---
  include/uapi/drm/i915_drm.h | 16 
  1 file changed, 16 insertions(+)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index a2def7b27009..b7e1c2fe08dc 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -3443,6 +3443,22 @@ struct drm_i915_gem_create_ext {
   * At which point we get the object handle in _i915_gem_create_ext.handle,
   * along with the final object size in _i915_gem_create_ext.size, which
   * should account for any rounding up, if required.
+ *
+ * Note that userspace has no means of knowing the current backing region
+ * for objects where @num_regions is larger than one. The kernel will only
+ * ensure that the priority order of the @regions array is honoured, either
+ * when initially placing the object, or when moving memory around due to
+ * memory pressure
+ *
+ * On Flat-CCS capable HW, compression is supported for the objects residing
+ * in I915_MEMORY_CLASS_DEVICE. When such objects (compressed) has other
+ * memory class in @regions and migrated (by I915, due to memory
+ * constrain) to the non I915_MEMORY_CLASS_DEVICE region, then I915 needs to
+ * decompress the content. But I915 dosen't have the required information to
+ * decompress the userspace compressed objects.
+ *
+ * So I915 supports Flat-CCS, only on the objects which can reside only on
+ * I915_MEMORY_CLASS_DEVICE regions.



I think it's fine to assume Flat-CSS surface will always be in lmem.

I see no issue for the Anv Vulkan driver.


Maybe Nanley or Ken can speak for the Iris GL driver?


-Lionel



   */
  struct drm_i915_gem_create_ext_memory_regions {
/** @base: Extension link. See struct i915_user_extension. */

Re: [Intel-gfx] [PATCH v2] drm/doc: add rfc section for small BAR uapi

2022-05-03 Thread Lionel Landwerlin


On 03/05/2022 17:27, Matthew Auld wrote:

On 03/05/2022 11:39, Lionel Landwerlin wrote:

On 03/05/2022 13:22, Matthew Auld wrote:

On 02/05/2022 09:53, Lionel Landwerlin wrote:

On 02/05/2022 10:54, Lionel Landwerlin wrote:

On 20/04/2022 20:13, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

v2:
   - Some spelling fixes and other small tweaks. (Akeem & Thomas)
   - Rework error capture interactions, including no longer needing
 NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
   - Add probed_cpu_visible_size. (Lionel)

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Lionel Landwerlin 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: Akeem G Abodunrin 
Cc: mesa-...@lists.freedesktop.org
---
  Documentation/gpu/rfc/i915_small_bar.h   | 190 
+++

  Documentation/gpu/rfc/i915_small_bar.rst |  58 +++
  Documentation/gpu/rfc/index.rst  |   4 +
  3 files changed, 252 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h 
b/Documentation/gpu/rfc/i915_small_bar.h

new file mode 100644
index ..7bfd0cf44d35
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,190 @@
+/**
+ * struct __drm_i915_memory_region_info - Describes one region 
as known to the

+ * driver.
+ *
+ * Note this is using both struct drm_i915_query_item and struct 
drm_i915_query.
+ * For this new query we are adding the new query id 
DRM_I915_QUERY_MEMORY_REGIONS

+ * at _i915_query_item.query_id.
+ */
+struct __drm_i915_memory_region_info {
+    /** @region: The class:instance pair encoding */
+    struct drm_i915_gem_memory_class_instance region;
+
+    /** @rsvd0: MBZ */
+    __u32 rsvd0;
+
+    /** @probed_size: Memory probed by the driver (-1 = unknown) */
+    __u64 probed_size;
+
+    /** @unallocated_size: Estimate of memory remaining (-1 = 
unknown) */

+    __u64 unallocated_size;
+
+    union {
+    /** @rsvd1: MBZ */
+    __u64 rsvd1[8];
+    struct {
+    /**
+ * @probed_cpu_visible_size: Memory probed by the 
driver

+ * that is CPU accessible. (-1 = unknown).
+ *
+ * This will be always be <= @probed_size, and the
+ * remainder(if there is any) will not be CPU
+ * accessible.
+ */
+    __u64 probed_cpu_visible_size;
+    };



Trying to implement userspace support in Vulkan for this, I have 
an additional question about the value of probed_cpu_visible_size.


When is it set to -1?

I'm guessing before there is support for this value it'll be 0 (MBZ).

After after it should either be the entire lmem or something smaller.


-Lionel



Other pain point of this new uAPI, previously we could query the 
unallocated size for each heap.


unallocated_size should always give the same value as probed_size. 
We have the avail tracking, but we don't currently expose that 
through unallocated_size, due to lack of real userspace/user etc.




Now lmem is effectively divided into 2 heaps, but unallocated_size 
is tracking allocation from both parts of lmem.


Yeah, if we ever properly expose the unallocated_size, then we could 
also just add unallocated_cpu_visible_size.




Is adding new I915_MEMORY_CLASS_DEVICE_NON_MAPPABLE out of question?


I don't think it's out of the question...

I guess user-space should be able to get the current flag behaviour 
just by specifying: device, system. And it does give more flexibly 
to allow something like: device, device-nm, smem.


We can also drop the probed_cpu_visible_size, which would now just 
be the probed_size with device/device-nm. And if we lack device-nm, 
then the entire thing must be CPU mappable.


One of the downsides though, is that we can no longer easily mix 
object pages from both device + device-nm, which we could previously 
do when we didn't specify the flag. At least according to the 
current design/behaviour for @regions that would not be allowed. I 
guess some kind of new flag like ALLOC_MIXED or so? Although 
currently that is only possible with device + device-nm in ttm/i915.



Thanks, I wasn't aware of the restrictions.

Adding unallocated_cpu_visible_size would be great.


So do we want this in the next version? i.e we already have a current 
real use case in mind for unallocated_size where probed_size is not 
good enough?



Yeah in the  next iteration.

We're using unallocated_size to implement VK_EXT_memory_budget and since 
I'm going to expose lmem mappable/unmappable as 2 different heaps on 
Vulkan, I would use that there too.



-Lionel







-Lionel







-Lionel






+    };
+};
+
+/**
+ * struct __drm_i915_gem_create_ext - Existing gem_create 
behaviour, with added

+ * extension support using struct i915_user_extension.
+ *
+ * Note that new buffer flags should b

Re: [Intel-gfx] [PATCH v2] drm/doc: add rfc section for small BAR uapi

2022-05-03 Thread Lionel Landwerlin


On 03/05/2022 13:22, Matthew Auld wrote:

On 02/05/2022 09:53, Lionel Landwerlin wrote:

On 02/05/2022 10:54, Lionel Landwerlin wrote:

On 20/04/2022 20:13, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

v2:
   - Some spelling fixes and other small tweaks. (Akeem & Thomas)
   - Rework error capture interactions, including no longer needing
 NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
   - Add probed_cpu_visible_size. (Lionel)

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Lionel Landwerlin 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: Akeem G Abodunrin 
Cc: mesa-...@lists.freedesktop.org
---
  Documentation/gpu/rfc/i915_small_bar.h   | 190 
+++

  Documentation/gpu/rfc/i915_small_bar.rst |  58 +++
  Documentation/gpu/rfc/index.rst  |   4 +
  3 files changed, 252 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h 
b/Documentation/gpu/rfc/i915_small_bar.h

new file mode 100644
index ..7bfd0cf44d35
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,190 @@
+/**
+ * struct __drm_i915_memory_region_info - Describes one region as 
known to the

+ * driver.
+ *
+ * Note this is using both struct drm_i915_query_item and struct 
drm_i915_query.
+ * For this new query we are adding the new query id 
DRM_I915_QUERY_MEMORY_REGIONS

+ * at _i915_query_item.query_id.
+ */
+struct __drm_i915_memory_region_info {
+    /** @region: The class:instance pair encoding */
+    struct drm_i915_gem_memory_class_instance region;
+
+    /** @rsvd0: MBZ */
+    __u32 rsvd0;
+
+    /** @probed_size: Memory probed by the driver (-1 = unknown) */
+    __u64 probed_size;
+
+    /** @unallocated_size: Estimate of memory remaining (-1 = 
unknown) */

+    __u64 unallocated_size;
+
+    union {
+    /** @rsvd1: MBZ */
+    __u64 rsvd1[8];
+    struct {
+    /**
+ * @probed_cpu_visible_size: Memory probed by the driver
+ * that is CPU accessible. (-1 = unknown).
+ *
+ * This will be always be <= @probed_size, and the
+ * remainder(if there is any) will not be CPU
+ * accessible.
+ */
+    __u64 probed_cpu_visible_size;
+    };



Trying to implement userspace support in Vulkan for this, I have an 
additional question about the value of probed_cpu_visible_size.


When is it set to -1?

I'm guessing before there is support for this value it'll be 0 (MBZ).

After after it should either be the entire lmem or something smaller.


-Lionel



Other pain point of this new uAPI, previously we could query the 
unallocated size for each heap.


unallocated_size should always give the same value as probed_size. We 
have the avail tracking, but we don't currently expose that through 
unallocated_size, due to lack of real userspace/user etc.




Now lmem is effectively divided into 2 heaps, but unallocated_size is 
tracking allocation from both parts of lmem.


Yeah, if we ever properly expose the unallocated_size, then we could 
also just add unallocated_cpu_visible_size.




Is adding new I915_MEMORY_CLASS_DEVICE_NON_MAPPABLE out of question?


I don't think it's out of the question...

I guess user-space should be able to get the current flag behaviour 
just by specifying: device, system. And it does give more flexibly to 
allow something like: device, device-nm, smem.


We can also drop the probed_cpu_visible_size, which would now just be 
the probed_size with device/device-nm. And if we lack device-nm, then 
the entire thing must be CPU mappable.


One of the downsides though, is that we can no longer easily mix 
object pages from both device + device-nm, which we could previously 
do when we didn't specify the flag. At least according to the current 
design/behaviour for @regions that would not be allowed. I guess some 
kind of new flag like ALLOC_MIXED or so? Although currently that is 
only possible with device + device-nm in ttm/i915.



Thanks, I wasn't aware of the restrictions.

Adding unallocated_cpu_visible_size would be great.


-Lionel







-Lionel






+    };
+};
+
+/**
+ * struct __drm_i915_gem_create_ext - Existing gem_create 
behaviour, with added

+ * extension support using struct i915_user_extension.
+ *
+ * Note that new buffer flags should be added here, at least for 
the stuff that
+ * is immutable. Previously we would have two ioctls, one to 
create the object
+ * with gem_create, and another to apply various parameters, 
however this
+ * creates some ambiguity for the params which are considered 
immutable. Also in

+ * general we're phasing out the various SET/GET ioctls.
+ */
+struct __drm_i915_gem_create_ext {
+    /**
+ * @size: Requested size for the object.
+ *
+ * The (page-aligned) al

Re: [Intel-gfx] [PATCH v2] drm/doc: add rfc section for small BAR uapi

2022-05-03 Thread Lionel Landwerlin


On 03/05/2022 12:07, Matthew Auld wrote:

On 02/05/2022 19:03, Lionel Landwerlin wrote:

On 02/05/2022 20:58, Abodunrin, Akeem G wrote:



-Original Message-
From: Landwerlin, Lionel G 
Sent: Monday, May 2, 2022 12:55 AM
To: Auld, Matthew ; 
intel-gfx@lists.freedesktop.org

Cc: dri-de...@lists.freedesktop.org; Thomas Hellström
; Bloomfield, Jon
; Daniel Vetter ; 
Justen,

Jordan L ; Kenneth Graunke
; Abodunrin, Akeem G
; mesa-...@lists.freedesktop.org
Subject: Re: [PATCH v2] drm/doc: add rfc section for small BAR uapi

On 20/04/2022 20:13, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

v2:
    - Some spelling fixes and other small tweaks. (Akeem & Thomas)
    - Rework error capture interactions, including no longer needing
  NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
    - Add probed_cpu_visible_size. (Lionel)

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Lionel Landwerlin 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: Akeem G Abodunrin 
Cc: mesa-...@lists.freedesktop.org
---
   Documentation/gpu/rfc/i915_small_bar.h   | 190

+++

Documentation/gpu/rfc/i915_small_bar.rst |  58 +++
   Documentation/gpu/rfc/index.rst  |   4 +
   3 files changed, 252 insertions(+)
   create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
   create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h
b/Documentation/gpu/rfc/i915_small_bar.h
new file mode 100644
index ..7bfd0cf44d35
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,190 @@
+/**
+ * struct __drm_i915_memory_region_info - Describes one region as
+known to the
+ * driver.
+ *
+ * Note this is using both struct drm_i915_query_item and struct

drm_i915_query.

+ * For this new query we are adding the new query id
+DRM_I915_QUERY_MEMORY_REGIONS
+ * at _i915_query_item.query_id.
+ */
+struct __drm_i915_memory_region_info {
+   /** @region: The class:instance pair encoding */
+   struct drm_i915_gem_memory_class_instance region;
+
+   /** @rsvd0: MBZ */
+   __u32 rsvd0;
+
+   /** @probed_size: Memory probed by the driver (-1 = unknown) */
+   __u64 probed_size;
+
+   /** @unallocated_size: Estimate of memory remaining (-1 = 
unknown)

*/

+   __u64 unallocated_size;
+
+   union {
+   /** @rsvd1: MBZ */
+   __u64 rsvd1[8];
+   struct {
+   /**
+    * @probed_cpu_visible_size: Memory probed by the

driver

+    * that is CPU accessible. (-1 = unknown).
+    *
+    * This will be always be <= @probed_size, and 
the

+    * remainder(if there is any) will not be CPU
+    * accessible.
+    */
+   __u64 probed_cpu_visible_size;
+   };


Trying to implement userspace support in Vulkan for this, I have an 
additional

question about the value of probed_cpu_visible_size.

When is it set to -1?
I believe it is set to -1 if it is unknown, and/or not cpu 
accessible...


Cheers!
~Akeem



So what should I expect on system memory?


I guess just probed_cpu_visible_size == probed_size. Or maybe we can 
just use -1 here?




What value is returned when all of probed_size is CPU visible on 
local memory?


probed_size == probed_cpu_visible_size.



Thanks, looks good to me.

Then maybe we should update the comment to say that.

Looks like there are no cases where we'll get -1.


-Lionel







Thanks,


-Lionel



I'm guessing before there is support for this value it'll be 0 (MBZ).

After after it should either be the entire lmem or something smaller.


-Lionel



+   };
+};
+
+/**
+ * struct __drm_i915_gem_create_ext - Existing gem_create behaviour,
+with added
+ * extension support using struct i915_user_extension.
+ *
+ * Note that new buffer flags should be added here, at least for the
+stuff that
+ * is immutable. Previously we would have two ioctls, one to create
+the object
+ * with gem_create, and another to apply various parameters, however
+this
+ * creates some ambiguity for the params which are considered
+immutable. Also in
+ * general we're phasing out the various SET/GET ioctls.
+ */
+struct __drm_i915_gem_create_ext {
+   /**
+    * @size: Requested size for the object.
+    *
+    * The (page-aligned) allocated size for the object will be 
returned.

+    *
+    * Note that for some devices we have might have further minimum
+    * page-size restrictions(larger than 4K), like for device 
local-memory.
+    * However in general the final size here should always 
reflect any

+    * rounding up, if for example using the

I915_GEM_CREATE_EXT_MEMORY_REGIONS

+    * extension to place the object in device local-memory.
+    */
+   __u64 size;
+   /**
+    * @handle: Returned handle for the object.
+    *
+    * Object handles are nonzero.
+    */
+   __u32

Re: [Intel-gfx] [PATCH v2] drm/doc: add rfc section for small BAR uapi

2022-05-02 Thread Lionel Landwerlin


On 02/05/2022 20:58, Abodunrin, Akeem G wrote:



-Original Message-
From: Landwerlin, Lionel G 
Sent: Monday, May 2, 2022 12:55 AM
To: Auld, Matthew ; intel-gfx@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org; Thomas Hellström
; Bloomfield, Jon
; Daniel Vetter ; Justen,
Jordan L ; Kenneth Graunke
; Abodunrin, Akeem G
; mesa-...@lists.freedesktop.org
Subject: Re: [PATCH v2] drm/doc: add rfc section for small BAR uapi

On 20/04/2022 20:13, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

v2:
- Some spelling fixes and other small tweaks. (Akeem & Thomas)
- Rework error capture interactions, including no longer needing
  NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
- Add probed_cpu_visible_size. (Lionel)

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Lionel Landwerlin 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: Akeem G Abodunrin 
Cc: mesa-...@lists.freedesktop.org
---
   Documentation/gpu/rfc/i915_small_bar.h   | 190

+++

   Documentation/gpu/rfc/i915_small_bar.rst |  58 +++
   Documentation/gpu/rfc/index.rst  |   4 +
   3 files changed, 252 insertions(+)
   create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
   create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h
b/Documentation/gpu/rfc/i915_small_bar.h
new file mode 100644
index ..7bfd0cf44d35
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,190 @@
+/**
+ * struct __drm_i915_memory_region_info - Describes one region as
+known to the
+ * driver.
+ *
+ * Note this is using both struct drm_i915_query_item and struct

drm_i915_query.

+ * For this new query we are adding the new query id
+DRM_I915_QUERY_MEMORY_REGIONS
+ * at _i915_query_item.query_id.
+ */
+struct __drm_i915_memory_region_info {
+   /** @region: The class:instance pair encoding */
+   struct drm_i915_gem_memory_class_instance region;
+
+   /** @rsvd0: MBZ */
+   __u32 rsvd0;
+
+   /** @probed_size: Memory probed by the driver (-1 = unknown) */
+   __u64 probed_size;
+
+   /** @unallocated_size: Estimate of memory remaining (-1 = unknown)

*/

+   __u64 unallocated_size;
+
+   union {
+   /** @rsvd1: MBZ */
+   __u64 rsvd1[8];
+   struct {
+   /**
+* @probed_cpu_visible_size: Memory probed by the

driver

+* that is CPU accessible. (-1 = unknown).
+*
+* This will be always be <= @probed_size, and the
+* remainder(if there is any) will not be CPU
+* accessible.
+*/
+   __u64 probed_cpu_visible_size;
+   };


Trying to implement userspace support in Vulkan for this, I have an additional
question about the value of probed_cpu_visible_size.

When is it set to -1?

I believe it is set to -1 if it is unknown, and/or not cpu accessible...

Cheers!
~Akeem



So what should I expect on system memory?

What value is returned when all of probed_size is CPU visible on local 
memory?



Thanks,


-Lionel



I'm guessing before there is support for this value it'll be 0 (MBZ).

After after it should either be the entire lmem or something smaller.


-Lionel



+   };
+};
+
+/**
+ * struct __drm_i915_gem_create_ext - Existing gem_create behaviour,
+with added
+ * extension support using struct i915_user_extension.
+ *
+ * Note that new buffer flags should be added here, at least for the
+stuff that
+ * is immutable. Previously we would have two ioctls, one to create
+the object
+ * with gem_create, and another to apply various parameters, however
+this
+ * creates some ambiguity for the params which are considered
+immutable. Also in
+ * general we're phasing out the various SET/GET ioctls.
+ */
+struct __drm_i915_gem_create_ext {
+   /**
+* @size: Requested size for the object.
+*
+* The (page-aligned) allocated size for the object will be returned.
+*
+* Note that for some devices we have might have further minimum
+* page-size restrictions(larger than 4K), like for device local-memory.
+* However in general the final size here should always reflect any
+* rounding up, if for example using the

I915_GEM_CREATE_EXT_MEMORY_REGIONS

+* extension to place the object in device local-memory.
+*/
+   __u64 size;
+   /**
+* @handle: Returned handle for the object.
+*
+* Object handles are nonzero.
+*/
+   __u32 handle;
+   /**
+* @flags: Optional flags.
+*
+* Supported values:
+*
+* I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS - Signal to the

kernel that

+* the object will need to be accessed via the CPU.
+*
+* Only valid when placing objects in I915_MEMORY_CLASS_DEVICE,

and

+* only strictly required on platforms where only some of the device
+* memory is d

Re: [Intel-gfx] [PATCH v2] drm/doc: add rfc section for small BAR uapi

2022-05-02 Thread Lionel Landwerlin


On 02/05/2022 10:54, Lionel Landwerlin wrote:

On 20/04/2022 20:13, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

v2:
   - Some spelling fixes and other small tweaks. (Akeem & Thomas)
   - Rework error capture interactions, including no longer needing
 NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
   - Add probed_cpu_visible_size. (Lionel)

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Lionel Landwerlin 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: Akeem G Abodunrin 
Cc: mesa-...@lists.freedesktop.org
---
  Documentation/gpu/rfc/i915_small_bar.h   | 190 +++
  Documentation/gpu/rfc/i915_small_bar.rst |  58 +++
  Documentation/gpu/rfc/index.rst  |   4 +
  3 files changed, 252 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h 
b/Documentation/gpu/rfc/i915_small_bar.h

new file mode 100644
index ..7bfd0cf44d35
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,190 @@
+/**
+ * struct __drm_i915_memory_region_info - Describes one region as 
known to the

+ * driver.
+ *
+ * Note this is using both struct drm_i915_query_item and struct 
drm_i915_query.
+ * For this new query we are adding the new query id 
DRM_I915_QUERY_MEMORY_REGIONS

+ * at _i915_query_item.query_id.
+ */
+struct __drm_i915_memory_region_info {
+    /** @region: The class:instance pair encoding */
+    struct drm_i915_gem_memory_class_instance region;
+
+    /** @rsvd0: MBZ */
+    __u32 rsvd0;
+
+    /** @probed_size: Memory probed by the driver (-1 = unknown) */
+    __u64 probed_size;
+
+    /** @unallocated_size: Estimate of memory remaining (-1 = 
unknown) */

+    __u64 unallocated_size;
+
+    union {
+    /** @rsvd1: MBZ */
+    __u64 rsvd1[8];
+    struct {
+    /**
+ * @probed_cpu_visible_size: Memory probed by the driver
+ * that is CPU accessible. (-1 = unknown).
+ *
+ * This will be always be <= @probed_size, and the
+ * remainder(if there is any) will not be CPU
+ * accessible.
+ */
+    __u64 probed_cpu_visible_size;
+    };



Trying to implement userspace support in Vulkan for this, I have an 
additional question about the value of probed_cpu_visible_size.


When is it set to -1?

I'm guessing before there is support for this value it'll be 0 (MBZ).

After after it should either be the entire lmem or something smaller.


-Lionel



Other pain point of this new uAPI, previously we could query the 
unallocated size for each heap.


Now lmem is effectively divided into 2 heaps, but unallocated_size is 
tracking allocation from both parts of lmem.


Is adding new I915_MEMORY_CLASS_DEVICE_NON_MAPPABLE out of question?


-Lionel






+    };
+};
+
+/**
+ * struct __drm_i915_gem_create_ext - Existing gem_create behaviour, 
with added

+ * extension support using struct i915_user_extension.
+ *
+ * Note that new buffer flags should be added here, at least for the 
stuff that
+ * is immutable. Previously we would have two ioctls, one to create 
the object
+ * with gem_create, and another to apply various parameters, however 
this
+ * creates some ambiguity for the params which are considered 
immutable. Also in

+ * general we're phasing out the various SET/GET ioctls.
+ */
+struct __drm_i915_gem_create_ext {
+    /**
+ * @size: Requested size for the object.
+ *
+ * The (page-aligned) allocated size for the object will be 
returned.

+ *
+ * Note that for some devices we have might have further minimum
+ * page-size restrictions(larger than 4K), like for device 
local-memory.

+ * However in general the final size here should always reflect any
+ * rounding up, if for example using the 
I915_GEM_CREATE_EXT_MEMORY_REGIONS

+ * extension to place the object in device local-memory.
+ */
+    __u64 size;
+    /**
+ * @handle: Returned handle for the object.
+ *
+ * Object handles are nonzero.
+ */
+    __u32 handle;
+    /**
+ * @flags: Optional flags.
+ *
+ * Supported values:
+ *
+ * I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS - Signal to the 
kernel that

+ * the object will need to be accessed via the CPU.
+ *
+ * Only valid when placing objects in I915_MEMORY_CLASS_DEVICE, and
+ * only strictly required on platforms where only some of the 
device
+ * memory is directly visible or mappable through the CPU, like 
on DG2+.

+ *
+ * One of the placements MUST also be I915_MEMORY_CLASS_SYSTEM, to
+ * ensure we can always spill the allocation to system memory, 
if we

+ * can't place the object in the mappable part of
+ * I915_MEMORY_CLASS_DEVICE.
+ *
+ * Note that since the kernel only supports f

Re: [Intel-gfx] [PATCH v2] drm/doc: add rfc section for small BAR uapi

2022-05-02 Thread Lionel Landwerlin


On 20/04/2022 20:13, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

v2:
   - Some spelling fixes and other small tweaks. (Akeem & Thomas)
   - Rework error capture interactions, including no longer needing
 NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
   - Add probed_cpu_visible_size. (Lionel)

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Lionel Landwerlin 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: Akeem G Abodunrin 
Cc: mesa-...@lists.freedesktop.org
---
  Documentation/gpu/rfc/i915_small_bar.h   | 190 +++
  Documentation/gpu/rfc/i915_small_bar.rst |  58 +++
  Documentation/gpu/rfc/index.rst  |   4 +
  3 files changed, 252 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h 
b/Documentation/gpu/rfc/i915_small_bar.h
new file mode 100644
index ..7bfd0cf44d35
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,190 @@
+/**
+ * struct __drm_i915_memory_region_info - Describes one region as known to the
+ * driver.
+ *
+ * Note this is using both struct drm_i915_query_item and struct 
drm_i915_query.
+ * For this new query we are adding the new query id 
DRM_I915_QUERY_MEMORY_REGIONS
+ * at _i915_query_item.query_id.
+ */
+struct __drm_i915_memory_region_info {
+   /** @region: The class:instance pair encoding */
+   struct drm_i915_gem_memory_class_instance region;
+
+   /** @rsvd0: MBZ */
+   __u32 rsvd0;
+
+   /** @probed_size: Memory probed by the driver (-1 = unknown) */
+   __u64 probed_size;
+
+   /** @unallocated_size: Estimate of memory remaining (-1 = unknown) */
+   __u64 unallocated_size;
+
+   union {
+   /** @rsvd1: MBZ */
+   __u64 rsvd1[8];
+   struct {
+   /**
+* @probed_cpu_visible_size: Memory probed by the driver
+* that is CPU accessible. (-1 = unknown).
+*
+* This will be always be <= @probed_size, and the
+* remainder(if there is any) will not be CPU
+* accessible.
+*/
+   __u64 probed_cpu_visible_size;
+   };



Trying to implement userspace support in Vulkan for this, I have an 
additional question about the value of probed_cpu_visible_size.


When is it set to -1?

I'm guessing before there is support for this value it'll be 0 (MBZ).

After after it should either be the entire lmem or something smaller.


-Lionel



+   };
+};
+
+/**
+ * struct __drm_i915_gem_create_ext - Existing gem_create behaviour, with added
+ * extension support using struct i915_user_extension.
+ *
+ * Note that new buffer flags should be added here, at least for the stuff that
+ * is immutable. Previously we would have two ioctls, one to create the object
+ * with gem_create, and another to apply various parameters, however this
+ * creates some ambiguity for the params which are considered immutable. Also 
in
+ * general we're phasing out the various SET/GET ioctls.
+ */
+struct __drm_i915_gem_create_ext {
+   /**
+* @size: Requested size for the object.
+*
+* The (page-aligned) allocated size for the object will be returned.
+*
+* Note that for some devices we have might have further minimum
+* page-size restrictions(larger than 4K), like for device local-memory.
+* However in general the final size here should always reflect any
+* rounding up, if for example using the 
I915_GEM_CREATE_EXT_MEMORY_REGIONS
+* extension to place the object in device local-memory.
+*/
+   __u64 size;
+   /**
+* @handle: Returned handle for the object.
+*
+* Object handles are nonzero.
+*/
+   __u32 handle;
+   /**
+* @flags: Optional flags.
+*
+* Supported values:
+*
+* I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS - Signal to the kernel that
+* the object will need to be accessed via the CPU.
+*
+* Only valid when placing objects in I915_MEMORY_CLASS_DEVICE, and
+* only strictly required on platforms where only some of the device
+* memory is directly visible or mappable through the CPU, like on DG2+.
+*
+* One of the placements MUST also be I915_MEMORY_CLASS_SYSTEM, to
+* ensure we can always spill the allocation to system memory, if we
+* can't place the object in the mappable part of
+* I915_MEMORY_CLASS_DEVICE.
+*
+* Note that since the kernel only supports flat-CCS on objects that can
+* *only* be placed in I915_MEMORY_CLASS_DEVICE, we therefor

Re: [Intel-gfx] [PATCH v2] drm/doc: add rfc section for small BAR uapi

2022-04-27 Thread Lionel Landwerlin


On 27/04/2022 18:18, Matthew Auld wrote:

On 27/04/2022 07:48, Lionel Landwerlin wrote:
One question though, how do we detect that this flag 
(I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS) is accepted on a given 
kernel?
I assume older kernels are going to reject object creation if we use 
this flag?


From some offline discussion with Lionel, the plan here is to just do 
a dummy gem_create_ext to check if the kernel throws an error with the 
new flag or not.




I didn't plan to use __drm_i915_query_vma_info, but isn't it 
inconsistent to select the placement on the GEM object and then query 
whether it's mappable by address?
You made a comment stating this is racy, wouldn't querying on the GEM 
object prevent this?


Since mesa at this time doesn't currently have a use for this one, 
then I guess we should maybe just drop this part of the uapi, in this 
version at least, if no objections.



Just repeating what we discussed (maybe I missed some other discussion 
and that's why I was confused) :



The way I was planning to use this is to have 3 heaps in Vulkan :

    - heap0: local only, no cpu visible

    - heap1: system, cpu visible

    - heap2: local & cpu visible


With heap2 having the reported probed_cpu_visible_size size.

It is an error for the application to map from heap0 [1].


With that said, it means if we created a GEM BO without 
I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS, we'll never mmap it.


So why the query?

I guess it would be useful when we import a buffer from another 
application. But in that case, why not have the query on the BO?



-Lionel


[1] : 
https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/vkMapMemory.html 
(VUID-vkMapMemory-memory-00682)






Thanks,

-Lionel

On 27/04/2022 09:35, Lionel Landwerlin wrote:

Hi Matt,


The proposal looks good to me.

Looking forward to try it on drm-tip.


-Lionel

On 20/04/2022 20:13, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

v2:
   - Some spelling fixes and other small tweaks. (Akeem & Thomas)
   - Rework error capture interactions, including no longer needing
 NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
   - Add probed_cpu_visible_size. (Lionel)

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Lionel Landwerlin 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: Akeem G Abodunrin 
Cc: mesa-...@lists.freedesktop.org
---
  Documentation/gpu/rfc/i915_small_bar.h   | 190 
+++

  Documentation/gpu/rfc/i915_small_bar.rst |  58 +++
  Documentation/gpu/rfc/index.rst  |   4 +
  3 files changed, 252 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h 
b/Documentation/gpu/rfc/i915_small_bar.h

new file mode 100644
index ..7bfd0cf44d35
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,190 @@
+/**
+ * struct __drm_i915_memory_region_info - Describes one region as 
known to the

+ * driver.
+ *
+ * Note this is using both struct drm_i915_query_item and struct 
drm_i915_query.
+ * For this new query we are adding the new query id 
DRM_I915_QUERY_MEMORY_REGIONS

+ * at _i915_query_item.query_id.
+ */
+struct __drm_i915_memory_region_info {
+    /** @region: The class:instance pair encoding */
+    struct drm_i915_gem_memory_class_instance region;
+
+    /** @rsvd0: MBZ */
+    __u32 rsvd0;
+
+    /** @probed_size: Memory probed by the driver (-1 = unknown) */
+    __u64 probed_size;
+
+    /** @unallocated_size: Estimate of memory remaining (-1 = 
unknown) */

+    __u64 unallocated_size;
+
+    union {
+    /** @rsvd1: MBZ */
+    __u64 rsvd1[8];
+    struct {
+    /**
+ * @probed_cpu_visible_size: Memory probed by the driver
+ * that is CPU accessible. (-1 = unknown).
+ *
+ * This will be always be <= @probed_size, and the
+ * remainder(if there is any) will not be CPU
+ * accessible.
+ */
+    __u64 probed_cpu_visible_size;
+    };
+    };
+};
+
+/**
+ * struct __drm_i915_gem_create_ext - Existing gem_create 
behaviour, with added

+ * extension support using struct i915_user_extension.
+ *
+ * Note that new buffer flags should be added here, at least for 
the stuff that
+ * is immutable. Previously we would have two ioctls, one to 
create the object
+ * with gem_create, and another to apply various parameters, 
however this
+ * creates some ambiguity for the params which are considered 
immutable. Also in

+ * general we're phasing out the various SET/GET ioctls.
+ */
+struct __drm_i915_gem_create_ext {
+    /**
+ * @size: Requested size for the object.
+ *
+ * The (page-aligned) allocated size for the object will be 
returned.

+ *
+ * Note that for some devices we have might have furt

Re: [Intel-gfx] [PATCH v2] drm/doc: add rfc section for small BAR uapi

2022-04-27 Thread Lionel Landwerlin

One question though, how do we detect that this flag 
(I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS) is accepted on a given kernel?
I assume older kernels are going to reject object creation if we use 
this flag?


I didn't plan to use __drm_i915_query_vma_info, but isn't it 
inconsistent to select the placement on the GEM object and then query 
whether it's mappable by address?
You made a comment stating this is racy, wouldn't querying on the GEM 
object prevent this?


Thanks,

-Lionel

On 27/04/2022 09:35, Lionel Landwerlin wrote:

Hi Matt,


The proposal looks good to me.

Looking forward to try it on drm-tip.


-Lionel

On 20/04/2022 20:13, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

v2:
   - Some spelling fixes and other small tweaks. (Akeem & Thomas)
   - Rework error capture interactions, including no longer needing
 NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
   - Add probed_cpu_visible_size. (Lionel)

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Lionel Landwerlin 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: Akeem G Abodunrin 
Cc: mesa-...@lists.freedesktop.org
---
  Documentation/gpu/rfc/i915_small_bar.h   | 190 +++
  Documentation/gpu/rfc/i915_small_bar.rst |  58 +++
  Documentation/gpu/rfc/index.rst  |   4 +
  3 files changed, 252 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h 
b/Documentation/gpu/rfc/i915_small_bar.h

new file mode 100644
index ..7bfd0cf44d35
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,190 @@
+/**
+ * struct __drm_i915_memory_region_info - Describes one region as 
known to the

+ * driver.
+ *
+ * Note this is using both struct drm_i915_query_item and struct 
drm_i915_query.
+ * For this new query we are adding the new query id 
DRM_I915_QUERY_MEMORY_REGIONS

+ * at _i915_query_item.query_id.
+ */
+struct __drm_i915_memory_region_info {
+    /** @region: The class:instance pair encoding */
+    struct drm_i915_gem_memory_class_instance region;
+
+    /** @rsvd0: MBZ */
+    __u32 rsvd0;
+
+    /** @probed_size: Memory probed by the driver (-1 = unknown) */
+    __u64 probed_size;
+
+    /** @unallocated_size: Estimate of memory remaining (-1 = 
unknown) */

+    __u64 unallocated_size;
+
+    union {
+    /** @rsvd1: MBZ */
+    __u64 rsvd1[8];
+    struct {
+    /**
+ * @probed_cpu_visible_size: Memory probed by the driver
+ * that is CPU accessible. (-1 = unknown).
+ *
+ * This will be always be <= @probed_size, and the
+ * remainder(if there is any) will not be CPU
+ * accessible.
+ */
+    __u64 probed_cpu_visible_size;
+    };
+    };
+};
+
+/**
+ * struct __drm_i915_gem_create_ext - Existing gem_create behaviour, 
with added

+ * extension support using struct i915_user_extension.
+ *
+ * Note that new buffer flags should be added here, at least for the 
stuff that
+ * is immutable. Previously we would have two ioctls, one to create 
the object
+ * with gem_create, and another to apply various parameters, however 
this
+ * creates some ambiguity for the params which are considered 
immutable. Also in

+ * general we're phasing out the various SET/GET ioctls.
+ */
+struct __drm_i915_gem_create_ext {
+    /**
+ * @size: Requested size for the object.
+ *
+ * The (page-aligned) allocated size for the object will be 
returned.

+ *
+ * Note that for some devices we have might have further minimum
+ * page-size restrictions(larger than 4K), like for device 
local-memory.

+ * However in general the final size here should always reflect any
+ * rounding up, if for example using the 
I915_GEM_CREATE_EXT_MEMORY_REGIONS

+ * extension to place the object in device local-memory.
+ */
+    __u64 size;
+    /**
+ * @handle: Returned handle for the object.
+ *
+ * Object handles are nonzero.
+ */
+    __u32 handle;
+    /**
+ * @flags: Optional flags.
+ *
+ * Supported values:
+ *
+ * I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS - Signal to the 
kernel that

+ * the object will need to be accessed via the CPU.
+ *
+ * Only valid when placing objects in I915_MEMORY_CLASS_DEVICE, and
+ * only strictly required on platforms where only some of the 
device
+ * memory is directly visible or mappable through the CPU, like 
on DG2+.

+ *
+ * One of the placements MUST also be I915_MEMORY_CLASS_SYSTEM, to
+ * ensure we can always spill the allocation to system memory, 
if we

+ * can't place the object in the mappable part of
+ * I915_MEMORY_CLASS_DEVICE.
+ *
+ * Note that since the kernel only supports flat-CCS on objects 
that can

+

Re: [Intel-gfx] [PATCH v2] drm/doc: add rfc section for small BAR uapi

2022-04-27 Thread Lionel Landwerlin


Hi Matt,


The proposal looks good to me.

Looking forward to try it on drm-tip.


-Lionel

On 20/04/2022 20:13, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

v2:
   - Some spelling fixes and other small tweaks. (Akeem & Thomas)
   - Rework error capture interactions, including no longer needing
 NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
   - Add probed_cpu_visible_size. (Lionel)

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Lionel Landwerlin 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: Akeem G Abodunrin 
Cc: mesa-...@lists.freedesktop.org
---
  Documentation/gpu/rfc/i915_small_bar.h   | 190 +++
  Documentation/gpu/rfc/i915_small_bar.rst |  58 +++
  Documentation/gpu/rfc/index.rst  |   4 +
  3 files changed, 252 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h 
b/Documentation/gpu/rfc/i915_small_bar.h
new file mode 100644
index ..7bfd0cf44d35
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,190 @@
+/**
+ * struct __drm_i915_memory_region_info - Describes one region as known to the
+ * driver.
+ *
+ * Note this is using both struct drm_i915_query_item and struct 
drm_i915_query.
+ * For this new query we are adding the new query id 
DRM_I915_QUERY_MEMORY_REGIONS
+ * at _i915_query_item.query_id.
+ */
+struct __drm_i915_memory_region_info {
+   /** @region: The class:instance pair encoding */
+   struct drm_i915_gem_memory_class_instance region;
+
+   /** @rsvd0: MBZ */
+   __u32 rsvd0;
+
+   /** @probed_size: Memory probed by the driver (-1 = unknown) */
+   __u64 probed_size;
+
+   /** @unallocated_size: Estimate of memory remaining (-1 = unknown) */
+   __u64 unallocated_size;
+
+   union {
+   /** @rsvd1: MBZ */
+   __u64 rsvd1[8];
+   struct {
+   /**
+* @probed_cpu_visible_size: Memory probed by the driver
+* that is CPU accessible. (-1 = unknown).
+*
+* This will be always be <= @probed_size, and the
+* remainder(if there is any) will not be CPU
+* accessible.
+*/
+   __u64 probed_cpu_visible_size;
+   };
+   };
+};
+
+/**
+ * struct __drm_i915_gem_create_ext - Existing gem_create behaviour, with added
+ * extension support using struct i915_user_extension.
+ *
+ * Note that new buffer flags should be added here, at least for the stuff that
+ * is immutable. Previously we would have two ioctls, one to create the object
+ * with gem_create, and another to apply various parameters, however this
+ * creates some ambiguity for the params which are considered immutable. Also 
in
+ * general we're phasing out the various SET/GET ioctls.
+ */
+struct __drm_i915_gem_create_ext {
+   /**
+* @size: Requested size for the object.
+*
+* The (page-aligned) allocated size for the object will be returned.
+*
+* Note that for some devices we have might have further minimum
+* page-size restrictions(larger than 4K), like for device local-memory.
+* However in general the final size here should always reflect any
+* rounding up, if for example using the 
I915_GEM_CREATE_EXT_MEMORY_REGIONS
+* extension to place the object in device local-memory.
+*/
+   __u64 size;
+   /**
+* @handle: Returned handle for the object.
+*
+* Object handles are nonzero.
+*/
+   __u32 handle;
+   /**
+* @flags: Optional flags.
+*
+* Supported values:
+*
+* I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS - Signal to the kernel that
+* the object will need to be accessed via the CPU.
+*
+* Only valid when placing objects in I915_MEMORY_CLASS_DEVICE, and
+* only strictly required on platforms where only some of the device
+* memory is directly visible or mappable through the CPU, like on DG2+.
+*
+* One of the placements MUST also be I915_MEMORY_CLASS_SYSTEM, to
+* ensure we can always spill the allocation to system memory, if we
+* can't place the object in the mappable part of
+* I915_MEMORY_CLASS_DEVICE.
+*
+* Note that since the kernel only supports flat-CCS on objects that can
+* *only* be placed in I915_MEMORY_CLASS_DEVICE, we therefore don't
+* support I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS together with
+* flat-CCS.
+*
+* Without this hint, the kernel will assume that non-mappable
+* I915_MEMORY_CLASS_DEVICE is pr

Re: [Intel-gfx] [PATCH 2/2] drm/doc: add rfc section for small BAR uapi

2022-03-18 Thread Lionel Landwerlin


Hey Matthew, all,

This sounds like a good thing to have.
There are a number of DG2 machines where we have a small BAR and this is 
causing more apps to fail.


Anv currently reports 3 memory heaps to the app :

    - local device only (not host visible) -> mapped to lmem
    - device/cpu -> mapped to smem
    - local device but also host visible -> mapped to lmem

So we could use this straight away, by just not putting the 
I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS flag on the allocation of the 
first heap.


One thing I don't see in this proposal is how can we get the size of the 
2 lmem heap : cpu visible, cpu not visible

We could use that to report the appropriate size to the app.
We probably want to report a new drm_i915_memory_region_info and either :
    - put one of the reserve field to use to indicate : cpu visible
    - or define a new enum value in drm_i915_gem_memory_class

Cheers,

-Lionel


On 18/02/2022 13:22, Matthew Auld wrote:

Add an entry for the new uapi needed for small BAR on DG2+.

Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Jon Bloomfield 
Cc: Daniel Vetter 
Cc: Jordan Justen 
Cc: Kenneth Graunke 
Cc: mesa-...@lists.freedesktop.org
---
  Documentation/gpu/rfc/i915_small_bar.h   | 153 +++
  Documentation/gpu/rfc/i915_small_bar.rst |  40 ++
  Documentation/gpu/rfc/index.rst  |   4 +
  3 files changed, 197 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
  create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst

diff --git a/Documentation/gpu/rfc/i915_small_bar.h 
b/Documentation/gpu/rfc/i915_small_bar.h
new file mode 100644
index ..fa65835fd608
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_small_bar.h
@@ -0,0 +1,153 @@
+/**
+ * struct __drm_i915_gem_create_ext - Existing gem_create behaviour, with added
+ * extension support using struct i915_user_extension.
+ *
+ * Note that in the future we want to have our buffer flags here, at least for
+ * the stuff that is immutable. Previously we would have two ioctls, one to
+ * create the object with gem_create, and another to apply various parameters,
+ * however this creates some ambiguity for the params which are considered
+ * immutable. Also in general we're phasing out the various SET/GET ioctls.
+ */
+struct __drm_i915_gem_create_ext {
+   /**
+* @size: Requested size for the object.
+*
+* The (page-aligned) allocated size for the object will be returned.
+*
+* Note that for some devices we have might have further minimum
+* page-size restrictions(larger than 4K), like for device local-memory.
+* However in general the final size here should always reflect any
+* rounding up, if for example using the 
I915_GEM_CREATE_EXT_MEMORY_REGIONS
+* extension to place the object in device local-memory.
+*/
+   __u64 size;
+   /**
+* @handle: Returned handle for the object.
+*
+* Object handles are nonzero.
+*/
+   __u32 handle;
+   /**
+* @flags: Optional flags.
+*
+* Supported values:
+*
+* I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS - Signal to the kernel that
+* the object will need to be accessed via the CPU.
+*
+* Only valid when placing objects in I915_MEMORY_CLASS_DEVICE, and
+* only strictly required on platforms where only some of the device
+* memory is directly visible or mappable through the CPU, like on DG2+.
+*
+* One of the placements MUST also be I915_MEMORY_CLASS_SYSTEM, to
+* ensure we can always spill the allocation to system memory, if we
+* can't place the object in the mappable part of
+* I915_MEMORY_CLASS_DEVICE.
+*
+* Note that buffers that need to be captured with EXEC_OBJECT_CAPTURE,
+* will need to enable this hint, if the object can also be placed in
+* I915_MEMORY_CLASS_DEVICE, starting from DG2+. The execbuf call will
+* throw an error otherwise. This also means that such objects will need
+* I915_MEMORY_CLASS_SYSTEM set as a possible placement.
+*
+* Without this hint, the kernel will assume that non-mappable
+* I915_MEMORY_CLASS_DEVICE is preferred for this object. Note that the
+* kernel can still migrate the object to the mappable part, as a last
+* resort, if userspace ever CPU faults this object, but this might be
+* expensive, and so ideally should be avoided.
+*/
+#define I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS (1 << 0)
+   __u32 flags;
+   /**
+* @extensions: The chain of extensions to apply to this object.
+*
+* This will be useful in the future when we need to support several
+* different extensions, and we need to apply more than one when
+* creating the object. See struct i915_user_extension.
+*
+* If we

Re: [Intel-gfx] [PATCH v4 12/16] uapi/drm/dg2: Introduce format modifier for DG2 clear color

2021-12-15 Thread Lionel Landwerlin


On 09/12/2021 17:45, Ramalingam C wrote:

From: Mika Kahola 

DG2 clear color render compression uses Tile4 layout. Therefore, we need
to define a new format modifier for uAPI to support clear color rendering.

Signed-off-by: Mika Kahola 
cc: Anshuman Gupta 
Signed-off-by: Juha-Pekka Heikkilä 
Signed-off-by: Ramalingam C 
---
  drivers/gpu/drm/i915/display/intel_fb.c| 8 
  drivers/gpu/drm/i915/display/skl_universal_plane.c | 9 -
  include/uapi/drm/drm_fourcc.h  | 8 
  3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/display/intel_fb.c 
b/drivers/gpu/drm/i915/display/intel_fb.c
index e15216f1cb82..f10e77cb5b4a 100644
--- a/drivers/gpu/drm/i915/display/intel_fb.c
+++ b/drivers/gpu/drm/i915/display/intel_fb.c
@@ -144,6 +144,12 @@ static const struct intel_modifier_desc intel_modifiers[] 
= {
.modifier = I915_FORMAT_MOD_4_TILED_DG2_MC_CCS,
.display_ver = { 13, 14 },
.plane_caps = INTEL_PLANE_CAP_TILING_4 | INTEL_PLANE_CAP_CCS_MC,
+   }, {
+   .modifier = I915_FORMAT_MOD_4_TILED_DG2_RC_CCS_CC,
+   .display_ver = { 13, 14 },
+   .plane_caps = INTEL_PLANE_CAP_TILING_4 | 
INTEL_PLANE_CAP_CCS_RC_CC,
+
+   .ccs.cc_planes = BIT(1),
}, {
.modifier = I915_FORMAT_MOD_4_TILED_DG2_RC_CCS,
.display_ver = { 13, 14 },
@@ -559,6 +565,7 @@ intel_tile_width_bytes(const struct drm_framebuffer *fb, 
int color_plane)
else
return 512;
case I915_FORMAT_MOD_4_TILED_DG2_RC_CCS:
+   case I915_FORMAT_MOD_4_TILED_DG2_RC_CCS_CC:
case I915_FORMAT_MOD_4_TILED_DG2_MC_CCS:
case I915_FORMAT_MOD_4_TILED:
/*
@@ -763,6 +770,7 @@ unsigned int intel_surf_alignment(const struct 
drm_framebuffer *fb,
case I915_FORMAT_MOD_Yf_TILED:
return 1 * 1024 * 1024;
case I915_FORMAT_MOD_4_TILED_DG2_RC_CCS:
+   case I915_FORMAT_MOD_4_TILED_DG2_RC_CCS_CC:
case I915_FORMAT_MOD_4_TILED_DG2_MC_CCS:
return 16 * 1024;
default:
diff --git a/drivers/gpu/drm/i915/display/skl_universal_plane.c 
b/drivers/gpu/drm/i915/display/skl_universal_plane.c
index d80424194c75..9a89df9c0243 100644
--- a/drivers/gpu/drm/i915/display/skl_universal_plane.c
+++ b/drivers/gpu/drm/i915/display/skl_universal_plane.c
@@ -772,6 +772,8 @@ static u32 skl_plane_ctl_tiling(u64 fb_modifier)
return PLANE_CTL_TILED_4 |
PLANE_CTL_MEDIA_DECOMPRESSION_ENABLE |
PLANE_CTL_CLEAR_COLOR_DISABLE;
+   case I915_FORMAT_MOD_4_TILED_DG2_RC_CCS_CC:
+   return PLANE_CTL_TILED_4 | 
PLANE_CTL_RENDER_DECOMPRESSION_ENABLE;
case I915_FORMAT_MOD_Y_TILED_CCS:
case I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS_CC:
return PLANE_CTL_TILED_Y | 
PLANE_CTL_RENDER_DECOMPRESSION_ENABLE;
@@ -2337,10 +2339,15 @@ skl_get_initial_plane_config(struct intel_crtc *crtc,
break;
case PLANE_CTL_TILED_YF: /* aka PLANE_CTL_TILED_4 on XE_LPD+ */
if (HAS_4TILE(dev_priv)) {
-   if (val & PLANE_CTL_RENDER_DECOMPRESSION_ENABLE)
+   u32 rc_mask = PLANE_CTL_RENDER_DECOMPRESSION_ENABLE |
+ PLANE_CTL_CLEAR_COLOR_DISABLE;
+
+   if ((val & rc_mask) == rc_mask)
fb->modifier = 
I915_FORMAT_MOD_4_TILED_DG2_RC_CCS;
else if (val & PLANE_CTL_MEDIA_DECOMPRESSION_ENABLE)
fb->modifier = 
I915_FORMAT_MOD_4_TILED_DG2_MC_CCS;
+   else if (val & PLANE_CTL_RENDER_DECOMPRESSION_ENABLE)
+   fb->modifier = 
I915_FORMAT_MOD_4_TILED_DG2_RC_CCS_CC;
else
fb->modifier = I915_FORMAT_MOD_4_TILED;
} else {
diff --git a/include/uapi/drm/drm_fourcc.h b/include/uapi/drm/drm_fourcc.h
index 51fdda26844a..b155f69f2344 100644
--- a/include/uapi/drm/drm_fourcc.h
+++ b/include/uapi/drm/drm_fourcc.h
@@ -598,6 +598,14 @@ extern "C" {
   */
  #define I915_FORMAT_MOD_4_TILED_DG2_MC_CCS fourcc_mod_code(INTEL, 11)
  


My colleague Nanley (Cc) had some requests for clarifications on this 
new modifier.


In particular in which plane is the clear color located.


I guess it wouldn't hurt to also state for each of the new modifiers 
defined in this series, how many planes and what data they contain.


Thanks,

-Lionel



+/*
+ * Intel color control surfaces (CCS) for DG2 clear color render compression.
+ *
+ * DG2 uses a unified compression format for clear color render compression.
+ * The general layout is a tiled layout using 4Kb tiles i.e. Tile4 layout.
+ */
+#define I915_FORMAT_MOD_4_TILED_DG2_RC_CCS_CC fourcc_mod_code(INTEL, 12)
+
  /*
   * Tiled, NV12MT, grouped in 64

Re: [Intel-gfx] [PATCH v2 2/2] drm/i915/uapi: Add query for hwconfig table

2021-11-04 Thread Lionel Landwerlin


On 04/11/2021 01:49, John Harrison wrote:

On 11/3/2021 14:38, Jordan Justen wrote:

John Harrison  writes:


On 11/1/2021 08:39, Jordan Justen wrote:

 writes:


From: Rodrigo Vivi 

GuC contains a consolidated table with a bunch of information 
about the

current device.

Previously, this information was spread and hardcoded to all the 
components

including GuC, i915 and various UMDs. The goal here is to consolidate
the data into GuC in a way that all interested components can grab 
the

very latest and synchronized information using a simple query.

As per most of the other queries, this one can be called twice.
Once with item.length=0 to determine the exact buffer size, then
allocate the user memory and call it again for to retrieve the
table data. For example:
    struct drm_i915_query_item item = {
  .query_id = DRM_I915_QUERY_HWCONCFIG_TABLE;
    };
    query.items_ptr = (int64_t) 
    query.num_items = 1;

    ioctl(fd, DRM_IOCTL_I915_QUERY, query, sizeof(query));

    if (item.length <= 0)
  return -ENOENT;

    data = malloc(item.length);
    item.data_ptr = (int64_t) 
    ioctl(fd, DRM_IOCTL_I915_QUERY, query, sizeof(query));

    // Parse the data as appropriate...

The returned array is a simple and flexible KLV (Key/Length/Value)
formatted table. For example, it could be just:
    enum device_attr {
   ATTR_SOME_VALUE = 0,
   ATTR_SOME_MASK  = 1,
    };

    static const u32 hwconfig[] = {
    ATTR_SOME_VALUE,
    1, // Value Length in DWords
    8, // Value

    ATTR_SOME_MASK,
    3,
    0x00, 0x, 0xFF00,
    };

Seems simple enough, so why doesn't i915 define the format of the
returned hwconfig blob in i915_drm.h?
Because the definition is nothing to do with i915. This table comes 
from

the hardware spec. It is not defined by the KMD and it is not currently
used by the KMD. So there is no reason for the KMD to be creating
structures for it in the same way that the KMD does not document,
define, struct, etc. every other feature of the hardware that the UMDs
might use.

So, i915 wants to wash it's hands completely of the format? There is
obviously a difference between hardware features and a blob coming from
closed source software. (Which i915 just happens to be passing along.)
The hardware is a lot more difficult to change...
Actually, no. The table is not "coming from closed source software". 
The table is defined by hardware specs. It is a table of hardware 
specific values. It is not being invented by the GuC just for fun or 
as a way to subvert the universe into the realms of closed source 
software. As per KMD, GuC is merely passing the table through. The 
table is only supported on newer hardware platforms and all GuC does 
is provide a mechanism for the KMD to retrieve it because the KMD 
cannot access it directly. The table contents are defined by hardware 
architects same as all the other aspects of the hardware.




It seems like these details should be dropped from the i915 patch commit
message since i915 wants nothing to do with it.

Sure. Can remove comments.



I would think it'd be preferable for i915 to stand behind the basic blob
format as is (even if the keys/values can't be defined), and make a new
query item if the closed source software changes the format.
Close source software is not allowed to change the format because 
closed source software has no say in defining the format. The format 
is officially defined as being fixed in the spec. New key values can 
be added to the key enumeration but existing values cannot be 
deprecated and re-purposed. The table must be stable across all OSs 
and all platforms. No software can arbitrarily decide to change it.




Of course, it'd be even better if i915 could define some keys/values as
well. (Or if a spec could be released to help document / tie down the
format.)
See the corresponding IGT test that details all the currently defined 
keys.






struct drm_i915_hwconfig {
uint32_t key;
uint32_t length;
uint32_t values[];
};

It sounds like the kernel depends on the closed source guc being 
loaded

to return this information. Is that right? Will i915 also become
dependent on some of this data such that it won't be able to 
initialize

without the firmware being loaded?

At the moment, the KMD does not use the table at all. We merely provide
a mechanism for the UMDs to retrieve it from the hardware.

In terms of future direction, that is something you need to take up 
with

the hardware architects.


Why do you keep saying hardware, when only software is involved?
See above - because the table is defined by hardware. No software, 
closed or open, has any say in the specification of the table.



The values in the table might be defined by hardware, but the table 
itself definitely isn't.


Like Jordan tried to explain, because this is a software interface, it 
changes more often than HW.


Now testing doesn't just involve the

Re: [Intel-gfx] [PATCH] DRM: i915: i915_perf: Fixed compiler warning

2021-08-10 Thread Lionel Landwerlin


On 09/08/2021 05:33, Julius Victorian wrote:

From: Julius 

Fixed compiler warning: "left shift of negative value"

Signed-off-by: Julius Victorian 


Reviewed-by: Lionel Landwerlin 


Thanks!


---
  drivers/gpu/drm/i915/i915_perf.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 9f94914958c3..7b852974241e 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -2804,7 +2804,7 @@ get_default_sseu_config(struct intel_sseu *out_sseu,
 * all available subslices per slice.
 */
out_sseu->subslice_mask =
-   ~(~0 << (hweight8(out_sseu->subslice_mask) / 2));
+   ~(~0U << (hweight8(out_sseu->subslice_mask) / 2));
out_sseu->slice_mask = 0x1;
}
  }

Re: [Intel-gfx] [PATCH 31/53] drm/i915/dg2: Report INSTDONE_GEOM values in error state

2021-07-02 Thread Lionel Landwerlin


On 01/07/2021 23:24, Matt Roper wrote:

Xe_HPG adds some additional INSTDONE_GEOM debug registers; the Mesa team
has indicated that having these reported in the error state would be
useful for debugging GPU hangs.  These registers are replicated per-DSS
with gslice steering.

Cc: Lionel Landwerlin 
Signed-off-by: Matt Roper 



Thanks,


Acked-by: Lionel Landwerlin 



---
  drivers/gpu/drm/i915/gt/intel_engine_cs.c|  7 +++
  drivers/gpu/drm/i915/gt/intel_engine_types.h |  3 +++
  drivers/gpu/drm/i915/i915_gpu_error.c| 10 --
  drivers/gpu/drm/i915/i915_reg.h  |  1 +
  4 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index e1302e9c168b..b3c002e4ae9f 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1220,6 +1220,13 @@ void intel_engine_get_instdone(const struct 
intel_engine_cs *engine,
  GEN7_ROW_INSTDONE);
}
}
+
+   if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 55)) {
+   for_each_instdone_gslice_dss_xehp(i915, sseu, iter, 
slice, subslice)
+   instdone->geom_svg[slice][subslice] =
+   read_subslice_reg(engine, slice, 
subslice,
+ 
XEHPG_INSTDONE_GEOM_SVG);
+   }
} else if (GRAPHICS_VER(i915) >= 7) {
instdone->instdone =
intel_uncore_read(uncore, RING_INSTDONE(mmio_base));
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index e917b7519f2b..93609d797ac2 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -80,6 +80,9 @@ struct intel_instdone {
u32 slice_common_extra[2];
u32 sampler[GEN_MAX_GSLICES][I915_MAX_SUBSLICES];
u32 row[GEN_MAX_GSLICES][I915_MAX_SUBSLICES];
+
+   /* Added in XeHPG */
+   u32 geom_svg[GEN_MAX_GSLICES][I915_MAX_SUBSLICES];
  };
  
  /*

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
b/drivers/gpu/drm/i915/i915_gpu_error.c
index c1e744b5ab47..4de7edc451ef 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -431,6 +431,7 @@ static void error_print_instdone(struct 
drm_i915_error_state_buf *m,
const struct sseu_dev_info *sseu = >engine->gt->info.sseu;
int slice;
int subslice;
+   int iter;
  
  	err_printf(m, "  INSTDONE: 0x%08x\n",

   ee->instdone.instdone);
@@ -445,8 +446,6 @@ static void error_print_instdone(struct 
drm_i915_error_state_buf *m,
return;
  
  	if (GRAPHICS_VER_FULL(m->i915) >= IP_VER(12, 50)) {

-   int iter;
-
for_each_instdone_gslice_dss_xehp(m->i915, sseu, iter, slice, 
subslice)
err_printf(m, "  SAMPLER_INSTDONE[%d][%d]: 0x%08x\n",
   slice, subslice,
@@ -471,6 +470,13 @@ static void error_print_instdone(struct 
drm_i915_error_state_buf *m,
if (GRAPHICS_VER(m->i915) < 12)
return;
  
+	if (GRAPHICS_VER_FULL(m->i915) >= IP_VER(12, 55)) {

+   for_each_instdone_gslice_dss_xehp(m->i915, sseu, iter, slice, 
subslice)
+   err_printf(m, "  GEOM_SVGUNIT_INSTDONE[%d][%d]: 
0x%08x\n",
+  slice, subslice,
+  ee->instdone.geom_svg[slice][subslice]);
+   }
+
err_printf(m, "  SC_INSTDONE_EXTRA: 0x%08x\n",
   ee->instdone.slice_common_extra[0]);
err_printf(m, "  SC_INSTDONE_EXTRA2: 0x%08x\n",
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 35a42df1f2aa..d58864c7adc6 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -2686,6 +2686,7 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
  #define GEN12_SC_INSTDONE_EXTRA2  _MMIO(0x7108)
  #define GEN7_SAMPLER_INSTDONE _MMIO(0xe160)
  #define GEN7_ROW_INSTDONE _MMIO(0xe164)
+#define XEHPG_INSTDONE_GEOM_SVG_MMIO(0x666c)
  #define MCFG_MCR_SELECTOR _MMIO(0xfd0)
  #define SF_MCR_SELECTOR   _MMIO(0xfd8)
  #define GEN8_MCR_SELECTOR _MMIO(0xfdc)



___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH 3/3] drm/i915/uapi: Add query for L3 bank count

2021-06-11 Thread Lionel Landwerlin


On 10/06/2021 23:46, john.c.harri...@intel.com wrote:

From: John Harrison 

Various UMDs need to know the L3 bank count. So add a query API for it.

Signed-off-by: John Harrison 
---
  drivers/gpu/drm/i915/gt/intel_gt.c | 15 +++
  drivers/gpu/drm/i915/gt/intel_gt.h |  1 +
  drivers/gpu/drm/i915/i915_query.c  | 22 ++
  drivers/gpu/drm/i915/i915_reg.h|  1 +
  include/uapi/drm/i915_drm.h|  1 +
  5 files changed, 40 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c 
b/drivers/gpu/drm/i915/gt/intel_gt.c
index 2161bf01ef8b..708bb3581d83 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -704,3 +704,18 @@ void intel_gt_info_print(const struct intel_gt_info *info,
  
  	intel_sseu_dump(>sseu, p);

  }
+
+int intel_gt_get_l3bank_count(struct intel_gt *gt)
+{
+   struct drm_i915_private *i915 = gt->i915;
+   intel_wakeref_t wakeref;
+   u32 fuse3;
+
+   if (GRAPHICS_VER(i915) < 12)
+   return -ENODEV;
+
+   with_intel_runtime_pm(gt->uncore->rpm, wakeref)
+   fuse3 = intel_uncore_read(gt->uncore, GEN10_MIRROR_FUSE3);
+
+   return hweight32(REG_FIELD_GET(GEN12_GT_L3_MODE_MASK, ~fuse3));
+}
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.h 
b/drivers/gpu/drm/i915/gt/intel_gt.h
index 7ec395cace69..46aa1cf4cf30 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt.h
@@ -77,6 +77,7 @@ static inline bool intel_gt_is_wedged(const struct intel_gt 
*gt)
  
  void intel_gt_info_print(const struct intel_gt_info *info,

 struct drm_printer *p);
+int intel_gt_get_l3bank_count(struct intel_gt *gt);
  
  void intel_gt_watchdog_work(struct work_struct *work);
  
diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c

index 96bd8fb3e895..0e92bb2d21b2 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -10,6 +10,7 @@
  #include "i915_perf.h"
  #include "i915_query.h"
  #include 
+#include "gt/intel_gt.h"
  
  static int copy_query_item(void *query_hdr, size_t query_sz,

   u32 total_length,
@@ -502,6 +503,26 @@ static int query_hwconfig_table(struct drm_i915_private 
*i915,
return hwconfig->size;
  }
  
+static int query_l3banks(struct drm_i915_private *i915,

+struct drm_i915_query_item *query_item)
+{
+   u32 banks;
+
+   if (query_item->length == 0)
+   return sizeof(banks);
+
+   if (query_item->length < sizeof(banks))
+   return -EINVAL;
+
+   banks = intel_gt_get_l3bank_count(>gt);
+
+   if (copy_to_user(u64_to_user_ptr(query_item->data_ptr),
+, sizeof(banks)))
+   return -EFAULT;
+
+   return sizeof(banks);
+}
+
  static int (* const i915_query_funcs[])(struct drm_i915_private *dev_priv,
struct drm_i915_query_item *query_item) 
= {
query_topology_info,
@@ -509,6 +530,7 @@ static int (* const i915_query_funcs[])(struct 
drm_i915_private *dev_priv,
query_perf_config,
query_memregion_info,
query_hwconfig_table,
+   query_l3banks,
  };
  
  int i915_query_ioctl(struct drm_device *dev, void *data, struct drm_file *file)

diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index eb13c601d680..e9ba88fe3db7 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -3099,6 +3099,7 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
  #define   GEN10_MIRROR_FUSE3  _MMIO(0x9118)
  #define GEN10_L3BANK_PAIR_COUNT 4
  #define GEN10_L3BANK_MASK   0x0F
+#define GEN12_GT_L3_MODE_MASK 0xFF
  
  #define GEN8_EU_DISABLE0		_MMIO(0x9134)

  #define   GEN8_EU_DIS0_S0_MASK0xff
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 87d369cae22a..20d18cca5066 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -2234,6 +2234,7 @@ struct drm_i915_query_item {
  #define DRM_I915_QUERY_PERF_CONFIG  3
  #define DRM_I915_QUERY_MEMORY_REGIONS   4
  #define DRM_I915_QUERY_HWCONFIG_TABLE   5
+#define DRM_I915_QUERY_L3_BANK_COUNT6



A little bit of documentation about the format of the return data would 
be nice :)



-Lionel



  /* Must be kept compact -- no holes and well documented */
  
  	/**



___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH i-g-t v2] lib/i915/perf: Fix non-card0 processing

2021-05-05 Thread Lionel Landwerlin


On 05/05/2021 15:41, Janusz Krzysztofik wrote:

IGT i915/perf library functions now always operate on sysfs perf
attributes of card0 device node, no matter which DRM device fd a user
passes.  The intention was to always switch to primary device node if
a user passes a render device node fd, but that breaks handling of
non-card0 devices.

If a user passed a render device node fd, find a primary device node of
the same device and use it instead of forcibly using the primary device
with minor number 0 when opening the device sysfs area.

v2: Don't assume primary minor matches render minor with masked type.

Signed-off-by: Janusz Krzysztofik 
Cc: Lionel Landwerlin 
---
  lib/i915/perf.c | 31 ---
  1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/lib/i915/perf.c b/lib/i915/perf.c
index 56d5c0b3a..d7768468e 100644
--- a/lib/i915/perf.c
+++ b/lib/i915/perf.c
@@ -372,14 +372,39 @@ open_master_sysfs_dir(int drm_fd)
  {
char path[128];
struct stat st;
+   int sysfs;
  
  	if (fstat(drm_fd, ) || !S_ISCHR(st.st_mode))

  return -1;
  
-snprintf(path, sizeof(path), "/sys/dev/char/%d:0",

- major(st.st_rdev));
+   snprintf(path, sizeof(path), "/sys/dev/char/%d:%d", major(st.st_rdev), 
minor(st.st_rdev));
+   sysfs = open(path, O_DIRECTORY);


Just to spell out the error paths :


if (sysfs < 0)

    return sysfs;


  
-	return open(path, O_DIRECTORY);

+   if (sysfs >= 0 && minor(st.st_rdev) >= 128) {



Then just if (minor(st.st_rdev) >= 128) { ...

Maybe add a comment above this is : /* If we were given a renderD* 
drm_fd, find it's associated cardX node. */




+   char device[100], cmp[100];
+   int device_len, cmp_len, i;
+
+   device_len = readlinkat(sysfs, "device", device, 
sizeof(device));
+   close(sysfs);
+   if (device_len < 0)
+   return device_len;
+
+   for (i = 0; i < 128; i++) {
+
+   snprintf(path, sizeof(path), "/sys/dev/char/%d:%d", 
major(st.st_rdev), i);
+   sysfs = open(path, O_DIRECTORY);
+   if (sysfs < 0)
+   continue;
+
+   cmp_len = readlinkat(sysfs, "device", cmp, sizeof(cmp));
+   if (cmp_len == device_len && !memcmp(cmp, device, 
cmp_len))
+   break;
+
+   close(sysfs);
You might want to set sysfs = -1 here just in the unlikely case this is 
never found.

+   }
+   }
+
+   return sysfs;
  }
  
  struct intel_perf *



With the proposed changes :

Reviewed-by: Lionel Landwerlin 

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [RFC PATCH i-g-t] lib/i915/perf: Fix non-card0 processing

2021-05-03 Thread Lionel Landwerlin


On 30/04/2021 19:18, Janusz Krzysztofik wrote:

IGT i915/perf library functions now always operate on sysfs perf
attributes of card0 device node, no matter which DRM device fd a user
passes.  The intention was to always switch to primary device node if
a user passes a render device node fd, but that breaks handling of
non-card0 devices.

Instead of forcibly using DRM device minor number 0 when opening a
device sysfs area, convert device minor number of a user passed device
fd to the minor number of respective primary (cardX) device node.

Signed-off-by: Janusz Krzysztofik 
---
  lib/i915/perf.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/i915/perf.c b/lib/i915/perf.c
index 56d5c0b3a..336824df7 100644
--- a/lib/i915/perf.c
+++ b/lib/i915/perf.c
@@ -376,8 +376,8 @@ open_master_sysfs_dir(int drm_fd)
if (fstat(drm_fd, ) || !S_ISCHR(st.st_mode))
  return -1;
  
-snprintf(path, sizeof(path), "/sys/dev/char/%d:0",

- major(st.st_rdev));
+snprintf(path, sizeof(path), "/sys/dev/char/%d:%d",
+ major(st.st_rdev), minor(st.st_rdev) & ~128);



Isn't it minor(st.st_rdev) & 0xff ? or even 0x3f ?

Looks like /dev/dri/controlD64 can exist too.


-Lionel


  
  	return open(path, O_DIRECTORY);

  }



___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH 1/1] i915/query: Correlate engine and cpu timestamps with better accuracy

2021-04-29 Thread Lionel Landwerlin


On 29/04/2021 03:34, Umesh Nerlige Ramappa wrote:

Perf measurements rely on CPU and engine timestamps to correlate
events of interest across these time domains. Current mechanisms get
these timestamps separately and the calculated delta between these
timestamps lack enough accuracy.

To improve the accuracy of these time measurements to within a few us,
add a query that returns the engine and cpu timestamps captured as
close to each other as possible.

v2: (Tvrtko)
- document clock reference used
- return cpu timestamp always
- capture cpu time just before lower dword of cs timestamp

v3: (Chris)
- use uncore-rpm
- use __query_cs_timestamp helper

v4: (Lionel)
- Kernel perf subsytem allows users to specify the clock id to be used
   in perf_event_open. This clock id is used by the perf subsystem to
   return the appropriate cpu timestamp in perf events. Similarly, let
   the user pass the clockid to this query so that cpu timestamp
   corresponds to the clock id requested.

v5: (Tvrtko)
- Use normal ktime accessors instead of fast versions
- Add more uApi documentation

v6: (Lionel)
- Move switch out of spinlock

v7: (Chris)
- cs_timestamp is a misnomer, use cs_cycles instead
- return the cs cycle frequency as well in the query

v8:
- Add platform and engine specific checks

v9: (Lionel)
- Return 2 cpu timestamps in the query - captured before and after the
   register read

v10: (Chris)
- Use local_clock() to measure time taken to read lower dword of
   register and return it to user.

v11: (Jani)
- IS_GEN deprecated. User GRAPHICS_VER instead.

v12: (Jason)
- Split cpu timestamp array into timestamp and delta for cleaner API

Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Lionel Landwerlin 



Thanks for the update :


Reviewed-by: Lionel Landwerlin 



---
  drivers/gpu/drm/i915/i915_query.c | 148 ++
  include/uapi/drm/i915_drm.h   |  52 +++
  2 files changed, 200 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_query.c 
b/drivers/gpu/drm/i915/i915_query.c
index fed337ad7b68..357c44e8177c 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -6,6 +6,8 @@
  
  #include 
  
+#include "gt/intel_engine_pm.h"

+#include "gt/intel_engine_user.h"
  #include "i915_drv.h"
  #include "i915_perf.h"
  #include "i915_query.h"
@@ -90,6 +92,151 @@ static int query_topology_info(struct drm_i915_private 
*dev_priv,
return total_length;
  }
  
+typedef u64 (*__ktime_func_t)(void);

+static __ktime_func_t __clock_id_to_func(clockid_t clk_id)
+{
+   /*
+* Use logic same as the perf subsystem to allow user to select the
+* reference clock id to be used for timestamps.
+*/
+   switch (clk_id) {
+   case CLOCK_MONOTONIC:
+   return _get_ns;
+   case CLOCK_MONOTONIC_RAW:
+   return _get_raw_ns;
+   case CLOCK_REALTIME:
+   return _get_real_ns;
+   case CLOCK_BOOTTIME:
+   return _get_boottime_ns;
+   case CLOCK_TAI:
+   return _get_clocktai_ns;
+   default:
+   return NULL;
+   }
+}
+
+static inline int
+__read_timestamps(struct intel_uncore *uncore,
+ i915_reg_t lower_reg,
+ i915_reg_t upper_reg,
+ u64 *cs_ts,
+ u64 *cpu_ts,
+ u64 *cpu_delta,
+ __ktime_func_t cpu_clock)
+{
+   u32 upper, lower, old_upper, loop = 0;
+
+   upper = intel_uncore_read_fw(uncore, upper_reg);
+   do {
+   *cpu_delta = local_clock();
+   *cpu_ts = cpu_clock();
+   lower = intel_uncore_read_fw(uncore, lower_reg);
+   *cpu_delta = local_clock() - *cpu_delta;
+   old_upper = upper;
+   upper = intel_uncore_read_fw(uncore, upper_reg);
+   } while (upper != old_upper && loop++ < 2);
+
+   *cs_ts = (u64)upper << 32 | lower;
+
+   return 0;
+}
+
+static int
+__query_cs_cycles(struct intel_engine_cs *engine,
+ u64 *cs_ts, u64 *cpu_ts, u64 *cpu_delta,
+ __ktime_func_t cpu_clock)
+{
+   struct intel_uncore *uncore = engine->uncore;
+   enum forcewake_domains fw_domains;
+   u32 base = engine->mmio_base;
+   intel_wakeref_t wakeref;
+   int ret;
+
+   fw_domains = intel_uncore_forcewake_for_reg(uncore,
+   RING_TIMESTAMP(base),
+   FW_REG_READ);
+
+   with_intel_runtime_pm(uncore->rpm, wakeref) {
+   spin_lock_irq(>lock);
+   intel_uncore_forcewake_get__locked(uncore, fw_domains);
+
+   ret = __read_timestamps(uncore,
+   RING_TIMESTAMP(base),
+   RING_TIMESTAMP_UDW(base),
+

Re: [Intel-gfx] [PATCH 1/1] i915/query: Correlate engine and cpu timestamps with better accuracy

2021-04-28 Thread Lionel Landwerlin


On 28/04/2021 23:45, Jason Ekstrand wrote:

On Wed, Apr 28, 2021 at 3:14 PM Lionel Landwerlin
 wrote:

On 28/04/2021 22:54, Jason Ekstrand wrote:

On Wed, Apr 28, 2021 at 2:50 PM Lionel Landwerlin
 wrote:

On 28/04/2021 22:24, Jason Ekstrand wrote:

On Wed, Apr 28, 2021 at 3:43 AM Jani Nikula  wrote:

On Tue, 27 Apr 2021, Umesh Nerlige Ramappa  
wrote:

Perf measurements rely on CPU and engine timestamps to correlate
events of interest across these time domains. Current mechanisms get
these timestamps separately and the calculated delta between these
timestamps lack enough accuracy.

To improve the accuracy of these time measurements to within a few us,
add a query that returns the engine and cpu timestamps captured as
close to each other as possible.

Cc: dri-devel, Jason and Daniel for review.

Thanks!

v2: (Tvrtko)
- document clock reference used
- return cpu timestamp always
- capture cpu time just before lower dword of cs timestamp

v3: (Chris)
- use uncore-rpm
- use __query_cs_timestamp helper

v4: (Lionel)
- Kernel perf subsytem allows users to specify the clock id to be used
in perf_event_open. This clock id is used by the perf subsystem to
return the appropriate cpu timestamp in perf events. Similarly, let
the user pass the clockid to this query so that cpu timestamp
corresponds to the clock id requested.

v5: (Tvrtko)
- Use normal ktime accessors instead of fast versions
- Add more uApi documentation

v6: (Lionel)
- Move switch out of spinlock

v7: (Chris)
- cs_timestamp is a misnomer, use cs_cycles instead
- return the cs cycle frequency as well in the query

v8:
- Add platform and engine specific checks

v9: (Lionel)
- Return 2 cpu timestamps in the query - captured before and after the
register read

v10: (Chris)
- Use local_clock() to measure time taken to read lower dword of
register and return it to user.

v11: (Jani)
- IS_GEN deprecated. User GRAPHICS_VER instead.

Signed-off-by: Umesh Nerlige Ramappa 
---
   drivers/gpu/drm/i915/i915_query.c | 145 ++
   include/uapi/drm/i915_drm.h   |  48 ++
   2 files changed, 193 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_query.c 
b/drivers/gpu/drm/i915/i915_query.c
index fed337ad7b68..2594b93901ac 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -6,6 +6,8 @@

   #include 

+#include "gt/intel_engine_pm.h"
+#include "gt/intel_engine_user.h"
   #include "i915_drv.h"
   #include "i915_perf.h"
   #include "i915_query.h"
@@ -90,6 +92,148 @@ static int query_topology_info(struct drm_i915_private 
*dev_priv,
return total_length;
   }

+typedef u64 (*__ktime_func_t)(void);
+static __ktime_func_t __clock_id_to_func(clockid_t clk_id)
+{
+ /*
+  * Use logic same as the perf subsystem to allow user to select the
+  * reference clock id to be used for timestamps.
+  */
+ switch (clk_id) {
+ case CLOCK_MONOTONIC:
+ return _get_ns;
+ case CLOCK_MONOTONIC_RAW:
+ return _get_raw_ns;
+ case CLOCK_REALTIME:
+ return _get_real_ns;
+ case CLOCK_BOOTTIME:
+ return _get_boottime_ns;
+ case CLOCK_TAI:
+ return _get_clocktai_ns;
+ default:
+ return NULL;
+ }
+}
+
+static inline int
+__read_timestamps(struct intel_uncore *uncore,
+   i915_reg_t lower_reg,
+   i915_reg_t upper_reg,
+   u64 *cs_ts,
+   u64 *cpu_ts,
+   __ktime_func_t cpu_clock)
+{
+ u32 upper, lower, old_upper, loop = 0;
+
+ upper = intel_uncore_read_fw(uncore, upper_reg);
+ do {
+ cpu_ts[1] = local_clock();
+ cpu_ts[0] = cpu_clock();
+ lower = intel_uncore_read_fw(uncore, lower_reg);
+ cpu_ts[1] = local_clock() - cpu_ts[1];
+ old_upper = upper;
+ upper = intel_uncore_read_fw(uncore, upper_reg);
+ } while (upper != old_upper && loop++ < 2);
+
+ *cs_ts = (u64)upper << 32 | lower;
+
+ return 0;
+}
+
+static int
+__query_cs_cycles(struct intel_engine_cs *engine,
+   u64 *cs_ts, u64 *cpu_ts,
+   __ktime_func_t cpu_clock)
+{
+ struct intel_uncore *uncore = engine->uncore;
+ enum forcewake_domains fw_domains;
+ u32 base = engine->mmio_base;
+ intel_wakeref_t wakeref;
+ int ret;
+
+ fw_domains = intel_uncore_forcewake_for_reg(uncore,
+ RING_TIMESTAMP(base),
+ FW_REG_READ);
+
+ with_intel_runtime_pm(uncore->rpm, wakeref) {
+ spin_lock_irq(>lock);
+ intel_uncore_forcewake_get__locked(uncore, fw_domains);
+
+ ret = __read_timestamps(uncore,
+ RING_TIMESTAMP(base),
+

Re: [Intel-gfx] [PATCH 1/1] i915/query: Correlate engine and cpu timestamps with better accuracy

2021-04-28 Thread Lionel Landwerlin


On 28/04/2021 23:14, Lionel Landwerlin wrote:

On 28/04/2021 22:54, Jason Ekstrand wrote:

On Wed, Apr 28, 2021 at 2:50 PM Lionel Landwerlin
 wrote:

On 28/04/2021 22:24, Jason Ekstrand wrote:

On Wed, Apr 28, 2021 at 3:43 AM Jani Nikula 
 wrote:


On Tue, 27 Apr 2021, Umesh Nerlige Ramappa 
 wrote:


Perf measurements rely on CPU and engine timestamps to correlate
events of interest across these time domains. Current mechanisms get
these timestamps separately and the calculated delta between these
timestamps lack enough accuracy.

To improve the accuracy of these time measurements to within a few us,
add a query that returns the engine and cpu timestamps captured as
close to each other as possible.

Cc: dri-devel, Jason and Daniel for review.

Thanks!

v2: (Tvrtko)
- document clock reference used
- return cpu timestamp always
- capture cpu time just before lower dword of cs timestamp

v3: (Chris)
- use uncore-rpm
- use __query_cs_timestamp helper

v4: (Lionel)
- Kernel perf subsytem allows users to specify the clock id to be used
   in perf_event_open. This clock id is used by the perf subsystem to
   return the appropriate cpu timestamp in perf events. Similarly, let
   the user pass the clockid to this query so that cpu timestamp
   corresponds to the clock id requested.

v5: (Tvrtko)
- Use normal ktime accessors instead of fast versions
- Add more uApi documentation

v6: (Lionel)
- Move switch out of spinlock

v7: (Chris)
- cs_timestamp is a misnomer, use cs_cycles instead
- return the cs cycle frequency as well in the query

v8:
- Add platform and engine specific checks

v9: (Lionel)
- Return 2 cpu timestamps in the query - captured before and after the
   register read

v10: (Chris)
- Use local_clock() to measure time taken to read lower dword of
   register and return it to user.

v11: (Jani)
- IS_GEN deprecated. User GRAPHICS_VER instead.

Signed-off-by: Umesh Nerlige Ramappa 
---
  drivers/gpu/drm/i915/i915_query.c | 145 
++

  include/uapi/drm/i915_drm.h   |  48 ++
  2 files changed, 193 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_query.c 
b/drivers/gpu/drm/i915/i915_query.c

index fed337ad7b68..2594b93901ac 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -6,6 +6,8 @@

  #include 

+#include "gt/intel_engine_pm.h"
+#include "gt/intel_engine_user.h"
  #include "i915_drv.h"
  #include "i915_perf.h"
  #include "i915_query.h"
@@ -90,6 +92,148 @@ static int query_topology_info(struct 
drm_i915_private *dev_priv,

   return total_length;
  }

+typedef u64 (*__ktime_func_t)(void);
+static __ktime_func_t __clock_id_to_func(clockid_t clk_id)
+{
+ /*
+  * Use logic same as the perf subsystem to allow user to 
select the

+  * reference clock id to be used for timestamps.
+  */
+ switch (clk_id) {
+ case CLOCK_MONOTONIC:
+ return _get_ns;
+ case CLOCK_MONOTONIC_RAW:
+ return _get_raw_ns;
+ case CLOCK_REALTIME:
+ return _get_real_ns;
+ case CLOCK_BOOTTIME:
+ return _get_boottime_ns;
+ case CLOCK_TAI:
+ return _get_clocktai_ns;
+ default:
+ return NULL;
+ }
+}
+
+static inline int
+__read_timestamps(struct intel_uncore *uncore,
+   i915_reg_t lower_reg,
+   i915_reg_t upper_reg,
+   u64 *cs_ts,
+   u64 *cpu_ts,
+   __ktime_func_t cpu_clock)
+{
+ u32 upper, lower, old_upper, loop = 0;
+
+ upper = intel_uncore_read_fw(uncore, upper_reg);
+ do {
+ cpu_ts[1] = local_clock();
+ cpu_ts[0] = cpu_clock();
+ lower = intel_uncore_read_fw(uncore, lower_reg);
+ cpu_ts[1] = local_clock() - cpu_ts[1];
+ old_upper = upper;
+ upper = intel_uncore_read_fw(uncore, upper_reg);
+ } while (upper != old_upper && loop++ < 2);
+
+ *cs_ts = (u64)upper << 32 | lower;
+
+ return 0;
+}
+
+static int
+__query_cs_cycles(struct intel_engine_cs *engine,
+   u64 *cs_ts, u64 *cpu_ts,
+   __ktime_func_t cpu_clock)
+{
+ struct intel_uncore *uncore = engine->uncore;
+ enum forcewake_domains fw_domains;
+ u32 base = engine->mmio_base;
+ intel_wakeref_t wakeref;
+ int ret;
+
+ fw_domains = intel_uncore_forcewake_for_reg(uncore,
+ RING_TIMESTAMP(base),
+ FW_REG_READ);
+
+ with_intel_runtime_pm(uncore->rpm, wakeref) {
+ spin_lock_irq(>lock);
+ intel_uncore_forcewake_get__locked(uncore, fw_domains);
+
+ ret = __read_timestamps(uncore,
+ RING_TIMESTAMP(base),
+ RING_TIMESTAMP_UDW(base),
+ cs_ts,
+ cpu_ts,
+ cpu_clock);
+
+ intel_uncore_forcewake_put__locked(uncore

Re: [Intel-gfx] [PATCH 1/1] i915/query: Correlate engine and cpu timestamps with better accuracy

2021-04-28 Thread Lionel Landwerlin


On 28/04/2021 22:54, Jason Ekstrand wrote:

On Wed, Apr 28, 2021 at 2:50 PM Lionel Landwerlin
 wrote:

On 28/04/2021 22:24, Jason Ekstrand wrote:

On Wed, Apr 28, 2021 at 3:43 AM Jani Nikula  wrote:

On Tue, 27 Apr 2021, Umesh Nerlige Ramappa  
wrote:

Perf measurements rely on CPU and engine timestamps to correlate
events of interest across these time domains. Current mechanisms get
these timestamps separately and the calculated delta between these
timestamps lack enough accuracy.

To improve the accuracy of these time measurements to within a few us,
add a query that returns the engine and cpu timestamps captured as
close to each other as possible.

Cc: dri-devel, Jason and Daniel for review.

Thanks!

v2: (Tvrtko)
- document clock reference used
- return cpu timestamp always
- capture cpu time just before lower dword of cs timestamp

v3: (Chris)
- use uncore-rpm
- use __query_cs_timestamp helper

v4: (Lionel)
- Kernel perf subsytem allows users to specify the clock id to be used
   in perf_event_open. This clock id is used by the perf subsystem to
   return the appropriate cpu timestamp in perf events. Similarly, let
   the user pass the clockid to this query so that cpu timestamp
   corresponds to the clock id requested.

v5: (Tvrtko)
- Use normal ktime accessors instead of fast versions
- Add more uApi documentation

v6: (Lionel)
- Move switch out of spinlock

v7: (Chris)
- cs_timestamp is a misnomer, use cs_cycles instead
- return the cs cycle frequency as well in the query

v8:
- Add platform and engine specific checks

v9: (Lionel)
- Return 2 cpu timestamps in the query - captured before and after the
   register read

v10: (Chris)
- Use local_clock() to measure time taken to read lower dword of
   register and return it to user.

v11: (Jani)
- IS_GEN deprecated. User GRAPHICS_VER instead.

Signed-off-by: Umesh Nerlige Ramappa 
---
  drivers/gpu/drm/i915/i915_query.c | 145 ++
  include/uapi/drm/i915_drm.h   |  48 ++
  2 files changed, 193 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_query.c 
b/drivers/gpu/drm/i915/i915_query.c
index fed337ad7b68..2594b93901ac 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -6,6 +6,8 @@

  #include 

+#include "gt/intel_engine_pm.h"
+#include "gt/intel_engine_user.h"
  #include "i915_drv.h"
  #include "i915_perf.h"
  #include "i915_query.h"
@@ -90,6 +92,148 @@ static int query_topology_info(struct drm_i915_private 
*dev_priv,
   return total_length;
  }

+typedef u64 (*__ktime_func_t)(void);
+static __ktime_func_t __clock_id_to_func(clockid_t clk_id)
+{
+ /*
+  * Use logic same as the perf subsystem to allow user to select the
+  * reference clock id to be used for timestamps.
+  */
+ switch (clk_id) {
+ case CLOCK_MONOTONIC:
+ return _get_ns;
+ case CLOCK_MONOTONIC_RAW:
+ return _get_raw_ns;
+ case CLOCK_REALTIME:
+ return _get_real_ns;
+ case CLOCK_BOOTTIME:
+ return _get_boottime_ns;
+ case CLOCK_TAI:
+ return _get_clocktai_ns;
+ default:
+ return NULL;
+ }
+}
+
+static inline int
+__read_timestamps(struct intel_uncore *uncore,
+   i915_reg_t lower_reg,
+   i915_reg_t upper_reg,
+   u64 *cs_ts,
+   u64 *cpu_ts,
+   __ktime_func_t cpu_clock)
+{
+ u32 upper, lower, old_upper, loop = 0;
+
+ upper = intel_uncore_read_fw(uncore, upper_reg);
+ do {
+ cpu_ts[1] = local_clock();
+ cpu_ts[0] = cpu_clock();
+ lower = intel_uncore_read_fw(uncore, lower_reg);
+ cpu_ts[1] = local_clock() - cpu_ts[1];
+ old_upper = upper;
+ upper = intel_uncore_read_fw(uncore, upper_reg);
+ } while (upper != old_upper && loop++ < 2);
+
+ *cs_ts = (u64)upper << 32 | lower;
+
+ return 0;
+}
+
+static int
+__query_cs_cycles(struct intel_engine_cs *engine,
+   u64 *cs_ts, u64 *cpu_ts,
+   __ktime_func_t cpu_clock)
+{
+ struct intel_uncore *uncore = engine->uncore;
+ enum forcewake_domains fw_domains;
+ u32 base = engine->mmio_base;
+ intel_wakeref_t wakeref;
+ int ret;
+
+ fw_domains = intel_uncore_forcewake_for_reg(uncore,
+ RING_TIMESTAMP(base),
+ FW_REG_READ);
+
+ with_intel_runtime_pm(uncore->rpm, wakeref) {
+ spin_lock_irq(>lock);
+ intel_uncore_forcewake_get__locked(uncore, fw_domains);
+
+ ret = __read_timestamps(uncore,
+ RING_TIMESTAMP(base),
+ RING_TIMESTAMP_UDW(base),
+ cs_ts,
+ cpu_ts,
+

Re: [Intel-gfx] [PATCH 1/1] i915/query: Correlate engine and cpu timestamps with better accuracy

2021-04-28 Thread Lionel Landwerlin


On 28/04/2021 22:24, Jason Ekstrand wrote:

On Wed, Apr 28, 2021 at 3:43 AM Jani Nikula  wrote:

On Tue, 27 Apr 2021, Umesh Nerlige Ramappa  
wrote:

Perf measurements rely on CPU and engine timestamps to correlate
events of interest across these time domains. Current mechanisms get
these timestamps separately and the calculated delta between these
timestamps lack enough accuracy.

To improve the accuracy of these time measurements to within a few us,
add a query that returns the engine and cpu timestamps captured as
close to each other as possible.

Cc: dri-devel, Jason and Daniel for review.

Thanks!


v2: (Tvrtko)
- document clock reference used
- return cpu timestamp always
- capture cpu time just before lower dword of cs timestamp

v3: (Chris)
- use uncore-rpm
- use __query_cs_timestamp helper

v4: (Lionel)
- Kernel perf subsytem allows users to specify the clock id to be used
   in perf_event_open. This clock id is used by the perf subsystem to
   return the appropriate cpu timestamp in perf events. Similarly, let
   the user pass the clockid to this query so that cpu timestamp
   corresponds to the clock id requested.

v5: (Tvrtko)
- Use normal ktime accessors instead of fast versions
- Add more uApi documentation

v6: (Lionel)
- Move switch out of spinlock

v7: (Chris)
- cs_timestamp is a misnomer, use cs_cycles instead
- return the cs cycle frequency as well in the query

v8:
- Add platform and engine specific checks

v9: (Lionel)
- Return 2 cpu timestamps in the query - captured before and after the
   register read

v10: (Chris)
- Use local_clock() to measure time taken to read lower dword of
   register and return it to user.

v11: (Jani)
- IS_GEN deprecated. User GRAPHICS_VER instead.

Signed-off-by: Umesh Nerlige Ramappa 
---
  drivers/gpu/drm/i915/i915_query.c | 145 ++
  include/uapi/drm/i915_drm.h   |  48 ++
  2 files changed, 193 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_query.c 
b/drivers/gpu/drm/i915/i915_query.c
index fed337ad7b68..2594b93901ac 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -6,6 +6,8 @@

  #include 

+#include "gt/intel_engine_pm.h"
+#include "gt/intel_engine_user.h"
  #include "i915_drv.h"
  #include "i915_perf.h"
  #include "i915_query.h"
@@ -90,6 +92,148 @@ static int query_topology_info(struct drm_i915_private 
*dev_priv,
   return total_length;
  }

+typedef u64 (*__ktime_func_t)(void);
+static __ktime_func_t __clock_id_to_func(clockid_t clk_id)
+{
+ /*
+  * Use logic same as the perf subsystem to allow user to select the
+  * reference clock id to be used for timestamps.
+  */
+ switch (clk_id) {
+ case CLOCK_MONOTONIC:
+ return _get_ns;
+ case CLOCK_MONOTONIC_RAW:
+ return _get_raw_ns;
+ case CLOCK_REALTIME:
+ return _get_real_ns;
+ case CLOCK_BOOTTIME:
+ return _get_boottime_ns;
+ case CLOCK_TAI:
+ return _get_clocktai_ns;
+ default:
+ return NULL;
+ }
+}
+
+static inline int
+__read_timestamps(struct intel_uncore *uncore,
+   i915_reg_t lower_reg,
+   i915_reg_t upper_reg,
+   u64 *cs_ts,
+   u64 *cpu_ts,
+   __ktime_func_t cpu_clock)
+{
+ u32 upper, lower, old_upper, loop = 0;
+
+ upper = intel_uncore_read_fw(uncore, upper_reg);
+ do {
+ cpu_ts[1] = local_clock();
+ cpu_ts[0] = cpu_clock();
+ lower = intel_uncore_read_fw(uncore, lower_reg);
+ cpu_ts[1] = local_clock() - cpu_ts[1];
+ old_upper = upper;
+ upper = intel_uncore_read_fw(uncore, upper_reg);
+ } while (upper != old_upper && loop++ < 2);
+
+ *cs_ts = (u64)upper << 32 | lower;
+
+ return 0;
+}
+
+static int
+__query_cs_cycles(struct intel_engine_cs *engine,
+   u64 *cs_ts, u64 *cpu_ts,
+   __ktime_func_t cpu_clock)
+{
+ struct intel_uncore *uncore = engine->uncore;
+ enum forcewake_domains fw_domains;
+ u32 base = engine->mmio_base;
+ intel_wakeref_t wakeref;
+ int ret;
+
+ fw_domains = intel_uncore_forcewake_for_reg(uncore,
+ RING_TIMESTAMP(base),
+ FW_REG_READ);
+
+ with_intel_runtime_pm(uncore->rpm, wakeref) {
+ spin_lock_irq(>lock);
+ intel_uncore_forcewake_get__locked(uncore, fw_domains);
+
+ ret = __read_timestamps(uncore,
+ RING_TIMESTAMP(base),
+ RING_TIMESTAMP_UDW(base),
+ cs_ts,
+ cpu_ts,
+ cpu_clock);
+
+ intel_uncore_forcewake_put__locked(uncore, fw_domains);
+ spin_unlock_irq(>lock);
+ }
+
+ return ret;
+}
+
+static int

Re: [Intel-gfx] [PATCH 1/1] i915/query: Correlate engine and cpu timestamps with better accuracy

2021-04-23 Thread Lionel Landwerlin


On 23/04/2021 18:11, Umesh Nerlige Ramappa wrote:

On Fri, Apr 23, 2021 at 10:05:34AM +0300, Lionel Landwerlin wrote:

On 21/04/2021 20:28, Umesh Nerlige Ramappa wrote:

Perf measurements rely on CPU and engine timestamps to correlate
events of interest across these time domains. Current mechanisms get
these timestamps separately and the calculated delta between these
timestamps lack enough accuracy.

To improve the accuracy of these time measurements to within a few us,
add a query that returns the engine and cpu timestamps captured as
close to each other as possible.

v2: (Tvrtko)
- document clock reference used
- return cpu timestamp always
- capture cpu time just before lower dword of cs timestamp

v3: (Chris)
- use uncore-rpm
- use __query_cs_timestamp helper

v4: (Lionel)
- Kernel perf subsytem allows users to specify the clock id to be used
  in perf_event_open. This clock id is used by the perf subsystem to
  return the appropriate cpu timestamp in perf events. Similarly, let
  the user pass the clockid to this query so that cpu timestamp
  corresponds to the clock id requested.

v5: (Tvrtko)
- Use normal ktime accessors instead of fast versions
- Add more uApi documentation

v6: (Lionel)
- Move switch out of spinlock

v7: (Chris)
- cs_timestamp is a misnomer, use cs_cycles instead
- return the cs cycle frequency as well in the query

v8:
- Add platform and engine specific checks

v9: (Lionel)
- Return 2 cpu timestamps in the query - captured before and after the
  register read

v10: (Chris)
- Use local_clock() to measure time taken to read lower dword of
  register and return it to user.

Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/i915_query.c | 145 ++
 include/uapi/drm/i915_drm.h   |  48 ++
 2 files changed, 193 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_query.c 
b/drivers/gpu/drm/i915/i915_query.c

index fed337ad7b68..25b96927ab92 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -6,6 +6,8 @@
 #include 
+#include "gt/intel_engine_pm.h"
+#include "gt/intel_engine_user.h"
 #include "i915_drv.h"
 #include "i915_perf.h"
 #include "i915_query.h"
@@ -90,6 +92,148 @@ static int query_topology_info(struct 
drm_i915_private *dev_priv,

 return total_length;
 }
+typedef u64 (*__ktime_func_t)(void);
+static __ktime_func_t __clock_id_to_func(clockid_t clk_id)
+{
+    /*
+ * Use logic same as the perf subsystem to allow user to select 
the

+ * reference clock id to be used for timestamps.
+ */
+    switch (clk_id) {
+    case CLOCK_MONOTONIC:
+    return _get_ns;
+    case CLOCK_MONOTONIC_RAW:
+    return _get_raw_ns;
+    case CLOCK_REALTIME:
+    return _get_real_ns;
+    case CLOCK_BOOTTIME:
+    return _get_boottime_ns;
+    case CLOCK_TAI:
+    return _get_clocktai_ns;
+    default:
+    return NULL;
+    }
+}
+
+static inline int
+__read_timestamps(struct intel_uncore *uncore,
+  i915_reg_t lower_reg,
+  i915_reg_t upper_reg,
+  u64 *cs_ts,
+  u64 *cpu_ts,
+  __ktime_func_t cpu_clock)
+{
+    u32 upper, lower, old_upper, loop = 0;
+
+    upper = intel_uncore_read_fw(uncore, upper_reg);
+    do {
+    cpu_ts[1] = local_clock();
+    cpu_ts[0] = cpu_clock();
+    lower = intel_uncore_read_fw(uncore, lower_reg);
+    cpu_ts[1] = local_clock() - cpu_ts[1];
+    old_upper = upper;
+    upper = intel_uncore_read_fw(uncore, upper_reg);
+    } while (upper != old_upper && loop++ < 2);
+
+    *cs_ts = (u64)upper << 32 | lower;
+
+    return 0;
+}
+
+static int
+__query_cs_cycles(struct intel_engine_cs *engine,
+  u64 *cs_ts, u64 *cpu_ts,
+  __ktime_func_t cpu_clock)
+{
+    struct intel_uncore *uncore = engine->uncore;
+    enum forcewake_domains fw_domains;
+    u32 base = engine->mmio_base;
+    intel_wakeref_t wakeref;
+    int ret;
+
+    fw_domains = intel_uncore_forcewake_for_reg(uncore,
+    RING_TIMESTAMP(base),
+    FW_REG_READ);
+
+    with_intel_runtime_pm(uncore->rpm, wakeref) {
+    spin_lock_irq(>lock);
+    intel_uncore_forcewake_get__locked(uncore, fw_domains);
+
+    ret = __read_timestamps(uncore,
+    RING_TIMESTAMP(base),
+    RING_TIMESTAMP_UDW(base),
+    cs_ts,
+    cpu_ts,
+    cpu_clock);
+
+    intel_uncore_forcewake_put__locked(uncore, fw_domains);
+    spin_unlock_irq(>lock);
+    }
+
+    return ret;
+}
+
+static int
+query_cs_cycles(struct drm_i915_private *i915,
+    struct drm_i915_query_item *query_item)
+{
+    struct drm_i915_query_cs_cycles __user *query_ptr;
+    struct drm_i915_query_cs_cycles query;
+    struct intel_engine_cs *engine;
+    __ktime_func_t cpu_clock;
+    int ret;
+
+    if (IN

Re: [Intel-gfx] [PATCH 1/1] i915/query: Correlate engine and cpu timestamps with better accuracy

2021-04-23 Thread Lionel Landwerlin


On 21/04/2021 20:28, Umesh Nerlige Ramappa wrote:

Perf measurements rely on CPU and engine timestamps to correlate
events of interest across these time domains. Current mechanisms get
these timestamps separately and the calculated delta between these
timestamps lack enough accuracy.

To improve the accuracy of these time measurements to within a few us,
add a query that returns the engine and cpu timestamps captured as
close to each other as possible.

v2: (Tvrtko)
- document clock reference used
- return cpu timestamp always
- capture cpu time just before lower dword of cs timestamp

v3: (Chris)
- use uncore-rpm
- use __query_cs_timestamp helper

v4: (Lionel)
- Kernel perf subsytem allows users to specify the clock id to be used
   in perf_event_open. This clock id is used by the perf subsystem to
   return the appropriate cpu timestamp in perf events. Similarly, let
   the user pass the clockid to this query so that cpu timestamp
   corresponds to the clock id requested.

v5: (Tvrtko)
- Use normal ktime accessors instead of fast versions
- Add more uApi documentation

v6: (Lionel)
- Move switch out of spinlock

v7: (Chris)
- cs_timestamp is a misnomer, use cs_cycles instead
- return the cs cycle frequency as well in the query

v8:
- Add platform and engine specific checks

v9: (Lionel)
- Return 2 cpu timestamps in the query - captured before and after the
   register read

v10: (Chris)
- Use local_clock() to measure time taken to read lower dword of
   register and return it to user.

Signed-off-by: Umesh Nerlige Ramappa 
---
  drivers/gpu/drm/i915/i915_query.c | 145 ++
  include/uapi/drm/i915_drm.h   |  48 ++
  2 files changed, 193 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_query.c 
b/drivers/gpu/drm/i915/i915_query.c
index fed337ad7b68..25b96927ab92 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -6,6 +6,8 @@
  
  #include 
  
+#include "gt/intel_engine_pm.h"

+#include "gt/intel_engine_user.h"
  #include "i915_drv.h"
  #include "i915_perf.h"
  #include "i915_query.h"
@@ -90,6 +92,148 @@ static int query_topology_info(struct drm_i915_private 
*dev_priv,
return total_length;
  }
  
+typedef u64 (*__ktime_func_t)(void);

+static __ktime_func_t __clock_id_to_func(clockid_t clk_id)
+{
+   /*
+* Use logic same as the perf subsystem to allow user to select the
+* reference clock id to be used for timestamps.
+*/
+   switch (clk_id) {
+   case CLOCK_MONOTONIC:
+   return _get_ns;
+   case CLOCK_MONOTONIC_RAW:
+   return _get_raw_ns;
+   case CLOCK_REALTIME:
+   return _get_real_ns;
+   case CLOCK_BOOTTIME:
+   return _get_boottime_ns;
+   case CLOCK_TAI:
+   return _get_clocktai_ns;
+   default:
+   return NULL;
+   }
+}
+
+static inline int
+__read_timestamps(struct intel_uncore *uncore,
+ i915_reg_t lower_reg,
+ i915_reg_t upper_reg,
+ u64 *cs_ts,
+ u64 *cpu_ts,
+ __ktime_func_t cpu_clock)
+{
+   u32 upper, lower, old_upper, loop = 0;
+
+   upper = intel_uncore_read_fw(uncore, upper_reg);
+   do {
+   cpu_ts[1] = local_clock();
+   cpu_ts[0] = cpu_clock();
+   lower = intel_uncore_read_fw(uncore, lower_reg);
+   cpu_ts[1] = local_clock() - cpu_ts[1];
+   old_upper = upper;
+   upper = intel_uncore_read_fw(uncore, upper_reg);
+   } while (upper != old_upper && loop++ < 2);
+
+   *cs_ts = (u64)upper << 32 | lower;
+
+   return 0;
+}
+
+static int
+__query_cs_cycles(struct intel_engine_cs *engine,
+ u64 *cs_ts, u64 *cpu_ts,
+ __ktime_func_t cpu_clock)
+{
+   struct intel_uncore *uncore = engine->uncore;
+   enum forcewake_domains fw_domains;
+   u32 base = engine->mmio_base;
+   intel_wakeref_t wakeref;
+   int ret;
+
+   fw_domains = intel_uncore_forcewake_for_reg(uncore,
+   RING_TIMESTAMP(base),
+   FW_REG_READ);
+
+   with_intel_runtime_pm(uncore->rpm, wakeref) {
+   spin_lock_irq(>lock);
+   intel_uncore_forcewake_get__locked(uncore, fw_domains);
+
+   ret = __read_timestamps(uncore,
+   RING_TIMESTAMP(base),
+   RING_TIMESTAMP_UDW(base),
+   cs_ts,
+   cpu_ts,
+   cpu_clock);
+
+   intel_uncore_forcewake_put__locked(uncore, fw_domains);
+   spin_unlock_irq(>lock);
+   }
+
+   return ret;
+}
+
+static int
+query_cs_cycles(struct drm_i915_private *i915,
+   struct drm_i915_query_item

Re: [Intel-gfx] [PATCH v3 00/16] Introduce Intel PXP

2021-04-01 Thread Lionel Landwerlin


On 29/03/2021 01:56, Daniele Ceraolo Spurio wrote:

PXP (Protected Xe Path) is an i915 component, available on
GEN12+, that helps to establish the hardware protected session
and manage the status of the alive software session, as well
as its life cycle.

Lots of minor changes and fixes, but the main changes in v3 are:

- Using a protected object with a context not appropriately marked does
   no longer result in an execbuf failure. This is to avoid apps
   maliciously sharing protected/invalid objects to other apps and
   causing them to fail.
- All the termination work now goes through the same worker function,
   which allows i915 to drop the mutex lock entirely.

Cc: Gaurav Kumar 
Cc: Chris Wilson 
Cc: Rodrigo Vivi 
Cc: Joonas Lahtinen 
Cc: Juston Li 
Cc: Alan Previn 
Cc: Lionel Landwerlin 



I updated the Mesa MR to use this new version :

    - Iris: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8092

    - Anv: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8064


No issue with this current iteration :


Tested-by: Lionel Landwerlin 




Anshuman Gupta (2):
   drm/i915/pxp: Add plane decryption support
   drm/i915/pxp: black pixels on pxp disabled

Bommu Krishnaiah (2):
   drm/i915/uapi: introduce drm_i915_gem_create_ext
   drm/i915/pxp: User interface for Protected buffer

Daniele Ceraolo Spurio (6):
   drm/i915/pxp: Define PXP component interface
   drm/i915/pxp: define PXP device flag and kconfig
   drm/i915/pxp: allocate a vcs context for pxp usage
   drm/i915/pxp: set KCR reg init
   drm/i915/pxp: interface for marking contexts as using protected
 content
   drm/i915/pxp: enable PXP for integrated Gen12

Huang, Sean Z (5):
   drm/i915/pxp: Implement funcs to create the TEE channel
   drm/i915/pxp: Create the arbitrary session after boot
   drm/i915/pxp: Implement arb session teardown
   drm/i915/pxp: Implement PXP irq handler
   drm/i915/pxp: Enable PXP power management

Vitaly Lubart (1):
   mei: pxp: export pavp client to me client bus

  drivers/gpu/drm/i915/Kconfig  |  11 +
  drivers/gpu/drm/i915/Makefile |   9 +
  .../drm/i915/display/skl_universal_plane.c|  50 +++-
  drivers/gpu/drm/i915/gem/i915_gem_context.c   |  59 +++-
  drivers/gpu/drm/i915/gem/i915_gem_context.h   |  18 ++
  .../gpu/drm/i915/gem/i915_gem_context_types.h |   2 +
  drivers/gpu/drm/i915/gem/i915_gem_create.c|  68 -
  .../gpu/drm/i915/gem/i915_gem_execbuffer.c|  34 +++
  drivers/gpu/drm/i915/gem/i915_gem_object.c|   6 +
  drivers/gpu/drm/i915/gem/i915_gem_object.h|  12 +
  .../gpu/drm/i915/gem/i915_gem_object_types.h  |  13 +
  drivers/gpu/drm/i915/gt/intel_engine.h|  12 +
  drivers/gpu/drm/i915/gt/intel_engine_cs.c |  32 ++-
  drivers/gpu/drm/i915/gt/intel_gpu_commands.h  |  22 +-
  drivers/gpu/drm/i915/gt/intel_gt.c|   5 +
  drivers/gpu/drm/i915/gt/intel_gt_irq.c|   7 +
  drivers/gpu/drm/i915/gt/intel_gt_pm.c |  14 +-
  drivers/gpu/drm/i915/gt/intel_gt_types.h  |   3 +
  drivers/gpu/drm/i915/i915_drv.c   |   4 +-
  drivers/gpu/drm/i915/i915_drv.h   |   4 +
  drivers/gpu/drm/i915/i915_pci.c   |   2 +
  drivers/gpu/drm/i915/i915_reg.h   |  48 
  drivers/gpu/drm/i915/intel_device_info.h  |   1 +
  drivers/gpu/drm/i915/pxp/intel_pxp.c  | 262 ++
  drivers/gpu/drm/i915/pxp/intel_pxp.h  |  65 +
  drivers/gpu/drm/i915/pxp/intel_pxp_cmd.c  | 140 ++
  drivers/gpu/drm/i915/pxp/intel_pxp_cmd.h  |  15 +
  drivers/gpu/drm/i915/pxp/intel_pxp_irq.c  | 100 +++
  drivers/gpu/drm/i915/pxp/intel_pxp_irq.h  |  32 +++
  drivers/gpu/drm/i915/pxp/intel_pxp_pm.c   |  37 +++
  drivers/gpu/drm/i915/pxp/intel_pxp_pm.h   |  23 ++
  drivers/gpu/drm/i915/pxp/intel_pxp_session.c  | 172 
  drivers/gpu/drm/i915/pxp/intel_pxp_session.h  |  15 +
  drivers/gpu/drm/i915/pxp/intel_pxp_tee.c  | 182 
  drivers/gpu/drm/i915/pxp/intel_pxp_tee.h  |  17 ++
  drivers/gpu/drm/i915/pxp/intel_pxp_types.h|  43 +++
  drivers/misc/mei/Kconfig  |   2 +
  drivers/misc/mei/Makefile |   1 +
  drivers/misc/mei/pxp/Kconfig  |  13 +
  drivers/misc/mei/pxp/Makefile |   7 +
  drivers/misc/mei/pxp/mei_pxp.c| 233 
  drivers/misc/mei/pxp/mei_pxp.h|  18 ++
  include/drm/i915_component.h  |   1 +
  include/drm/i915_pxp_tee_interface.h  |  45 +++
  include/uapi/drm/i915_drm.h   |  96 +++
  45 files changed, 1931 insertions(+), 24 deletions(-)
  create mode 100644 drivers/gpu/drm/i915/pxp/intel_pxp.c
  create mode 100644 drivers/gpu/drm/i915/pxp/intel_pxp.h
  create mode 100644 drivers/gpu/drm/i915/pxp/intel_pxp_cmd.c
  create mode 100644 drivers/gpu/drm/i915/pxp/intel_pxp_cmd.h
  create mode 100644 drivers/gpu/drm/i915/pxp

Re: [Intel-gfx] [PATCH v3 11/16] drm/i915/pxp: interface for marking contexts as using protected content

2021-04-01 Thread Lionel Landwerlin


On 29/03/2021 01:57, Daniele Ceraolo Spurio wrote:

Extra tracking and checks around protected objects, coming in a follow-up
patch, will be enabled only for contexts that opt in. Contexts can only be
marked as using protected content at creation time and they must be both
bannable and not recoverable.

When a PXP teardown occurs, all gem contexts marked this way that
have been used at least once will be marked as invalid and all new
submissions using them will be rejected. All intel contexts within the
invalidated gem contexts will be marked banned.
A new flag has been added to the RESET_STATS ioctl to report the
invalidation to userspace.

v2: split to its own patch and improve doc (Chris), invalidate contexts
on teardown

v3: improve doc, use -EACCES for execbuf fail (Chris), make protected
 context flag not mandatory in protected object execbuf to avoid
 abuse (Lionel)

Signed-off-by: Daniele Ceraolo Spurio 
Cc: Chris Wilson 
Cc: Lionel Landwerlin 
---
  drivers/gpu/drm/i915/gem/i915_gem_context.c   | 59 ++-
  drivers/gpu/drm/i915/gem/i915_gem_context.h   | 18 ++
  .../gpu/drm/i915/gem/i915_gem_context_types.h |  2 +
  .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 18 ++
  drivers/gpu/drm/i915/pxp/intel_pxp.c  | 48 +++
  drivers/gpu/drm/i915/pxp/intel_pxp.h  |  1 +
  drivers/gpu/drm/i915/pxp/intel_pxp_session.c  |  3 +
  include/uapi/drm/i915_drm.h   | 26 
  8 files changed, 172 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index fd8ee52e17a4..f3fd302682bb 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -76,6 +76,8 @@
  #include "gt/intel_gpu_commands.h"
  #include "gt/intel_ring.h"
  
+#include "pxp/intel_pxp.h"

+
  #include "i915_gem_context.h"
  #include "i915_globals.h"
  #include "i915_trace.h"
@@ -1972,6 +1974,40 @@ static int set_priority(struct i915_gem_context *ctx,
return 0;
  }
  
+static int set_protected(struct i915_gem_context *ctx,

+const struct drm_i915_gem_context_param *args)
+{
+   int ret = 0;
+
+   if (!intel_pxp_is_enabled(>i915->gt.pxp))
+   ret = -ENODEV;
+   else if (ctx->file_priv) /* can't change this after creation! */
+   ret = -EEXIST;
+   else if (args->size)
+   ret = -EINVAL;
+   else if (!args->value)
+   clear_bit(UCONTEXT_PROTECTED, >user_flags);
+   else if (i915_gem_context_is_recoverable(ctx) ||
+!i915_gem_context_is_bannable(ctx))
+   ret = -EPERM;
+   else
+   set_bit(UCONTEXT_PROTECTED, >user_flags);
+
+   return ret;
+}
+
+static int get_protected(struct i915_gem_context *ctx,
+struct drm_i915_gem_context_param *args)
+{
+   if (!intel_pxp_is_enabled(>i915->gt.pxp))
+   return -ENODEV;
+
+   args->size = 0;
+   args->value = i915_gem_context_uses_protected_content(ctx);
+
+   return 0;
+}
+
  static int ctx_setparam(struct drm_i915_file_private *fpriv,
struct i915_gem_context *ctx,
struct drm_i915_gem_context_param *args)
@@ -2004,6 +2040,8 @@ static int ctx_setparam(struct drm_i915_file_private 
*fpriv,
ret = -EPERM;
else if (args->value)
i915_gem_context_set_bannable(ctx);
+   else if (i915_gem_context_uses_protected_content(ctx))
+   ret = -EPERM; /* can't clear this for protected 
contexts */
else
i915_gem_context_clear_bannable(ctx);
break;
@@ -2011,10 +2049,12 @@ static int ctx_setparam(struct drm_i915_file_private 
*fpriv,
case I915_CONTEXT_PARAM_RECOVERABLE:
if (args->size)
ret = -EINVAL;
-   else if (args->value)
-   i915_gem_context_set_recoverable(ctx);
-   else
+   else if (!args->value)
i915_gem_context_clear_recoverable(ctx);
+   else if (i915_gem_context_uses_protected_content(ctx))
+   ret = -EPERM; /* can't set this for protected contexts 
*/
+   else
+   i915_gem_context_set_recoverable(ctx);
break;
  
  	case I915_CONTEXT_PARAM_PRIORITY:

@@ -2041,6 +2081,10 @@ static int ctx_setparam(struct drm_i915_file_private 
*fpriv,
ret = set_ringsize(ctx, args);
break;
  
+	case I915_CONTEXT_PARAM_PROTECTED_CONTENT:

+   ret = set_protected(ctx, args);
+   break;
+
case I915_CONTEXT_PARAM_BAN_PERIOD:
default:
ret

Re: [Intel-gfx] [PATCH v3 13/16] drm/i915/pxp: User interface for Protected buffer

2021-04-01 Thread Lionel Landwerlin


On 29/03/2021 01:57, Daniele Ceraolo Spurio wrote:

From: Bommu Krishnaiah 

This api allow user mode to create Protected buffers. Only contexts
marked as protected are allowed to operate on protected buffers.

We only allow setting the flags at creation time.

All protected objects that have backing storage will be considered
invalid when the session is destroyed and they won't be usable anymore.

This is a rework of the original code by Bommu Krishnaiah. I've
authorship unchanged since significant chunks have not been modified.

v2: split context changes, fix defines and improve documentation (Chris),
 add object invalidation logic
v3: fix spinlock definition and usage, only validate objects when
 they're first added to a context lut, only remove them once (Chris),
 make protected context flag not mandatory in protected object execbuf
 to avoid abuse (Lionel)

Signed-off-by: Bommu Krishnaiah 
Signed-off-by: Daniele Ceraolo Spurio 
Cc: Telukuntla Sreedhar 
Cc: Kondapally Kalyan 
Cc: Gupta Anshuman 
Cc: Huang Sean Z 
Cc: Chris Wilson 
Cc: Lionel Landwerlin 
---
  drivers/gpu/drm/i915/gem/i915_gem_create.c| 27 ++--
  .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 16 
  drivers/gpu/drm/i915/gem/i915_gem_object.c|  6 +++
  drivers/gpu/drm/i915/gem/i915_gem_object.h| 12 ++
  .../gpu/drm/i915/gem/i915_gem_object_types.h  | 13 ++
  drivers/gpu/drm/i915/pxp/intel_pxp.c  | 41 +++
  drivers/gpu/drm/i915/pxp/intel_pxp.h  | 13 ++
  drivers/gpu/drm/i915/pxp/intel_pxp_types.h|  5 +++
  include/uapi/drm/i915_drm.h   | 20 +
  9 files changed, 150 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_create.c 
b/drivers/gpu/drm/i915/gem/i915_gem_create.c
index 3ad3413c459f..d02e5938afbe 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_create.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_create.c
@@ -5,6 +5,7 @@
  
  #include "gem/i915_gem_ioctls.h"

  #include "gem/i915_gem_region.h"
+#include "pxp/intel_pxp.h"
  
  #include "i915_drv.h"

  #include "i915_user_extensions.h"
@@ -13,7 +14,8 @@ static int
  i915_gem_create(struct drm_file *file,
struct intel_memory_region *mr,
u64 *size_p,
-   u32 *handle_p)
+   u32 *handle_p,
+   u64 user_flags)
  {
struct drm_i915_gem_object *obj;
u32 handle;
@@ -35,12 +37,17 @@ i915_gem_create(struct drm_file *file,
  
  	GEM_BUG_ON(size != obj->base.size);
  
+	obj->user_flags = user_flags;

+
ret = drm_gem_handle_create(file, >base, );
/* drop reference from allocate - handle holds it now */
i915_gem_object_put(obj);
if (ret)
return ret;
  
+	if (user_flags & I915_GEM_OBJECT_PROTECTED)

+   intel_pxp_object_add(obj);
+
*handle_p = handle;
*size_p = size;
return 0;
@@ -89,11 +96,12 @@ i915_gem_dumb_create(struct drm_file *file,
return i915_gem_create(file,
   intel_memory_region_by_type(to_i915(dev),
   mem_type),
-  >size, >handle);
+  >size, >handle, 0);
  }
  
  struct create_ext {

struct drm_i915_private *i915;
+   unsigned long user_flags;
  };
  
  static int __create_setparam(struct drm_i915_gem_object_param *args,

@@ -104,6 +112,19 @@ static int __create_setparam(struct 
drm_i915_gem_object_param *args,
return -EINVAL;
}
  
+	switch (lower_32_bits(args->param)) {

+   case I915_OBJECT_PARAM_PROTECTED_CONTENT:
+   if (!intel_pxp_is_enabled(_data->i915->gt.pxp))
+   return -ENODEV;
+   if (args->size) {
+   return -EINVAL;
+   } else if (args->data) {
+   ext_data->user_flags |= I915_GEM_OBJECT_PROTECTED;
+   return 0;
+   }
+   break;
+   }
+
return -EINVAL;
  }
  
@@ -148,5 +169,5 @@ i915_gem_create_ioctl(struct drm_device *dev, void *data,

return i915_gem_create(file,
   intel_memory_region_by_type(i915,
   INTEL_MEMORY_SYSTEM),
-  >size, >handle);
+  >size, >handle, ext_data.user_flags);
  }
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 72c2470fcfe6..2fb6579ad301 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -20,6 +20,7 @@
  #include "gt/intel_gt_buffer_pool.h"
  #include "gt/intel_gt_pm.h"
  #include "gt/intel_ring.h"
+#include &qu

Re: [Intel-gfx] [PATCH v2 13/16] drm/i915/pxp: User interface for Protected buffer

2021-03-08 Thread Lionel Landwerlin


On 08/03/2021 22:40, Rodrigo Vivi wrote:

On Wed, Mar 03, 2021 at 05:24:34PM -0800, Daniele Ceraolo Spurio wrote:


On 3/3/2021 4:10 PM, Daniele Ceraolo Spurio wrote:


On 3/3/2021 3:42 PM, Lionel Landwerlin wrote:

On 04/03/2021 01:25, Daniele Ceraolo Spurio wrote:


On 3/3/2021 3:16 PM, Lionel Landwerlin wrote:

On 03/03/2021 23:59, Daniele Ceraolo Spurio wrote:


On 3/3/2021 12:39 PM, Lionel Landwerlin wrote:

On 01/03/2021 21:31, Daniele Ceraolo Spurio wrote:

From: Bommu Krishnaiah 

This api allow user mode to create Protected buffers. Only contexts
marked as protected are allowed to operate on protected buffers.

We only allow setting the flags at creation time.

All protected objects that have backing storage will be considered
invalid when the session is destroyed and they
won't be usable anymore.

This is a rework of the original code by Bommu Krishnaiah. I've
authorship unchanged since significant chunks
have not been modified.

v2: split context changes, fix defines and
improve documentation (Chris),
  add object invalidation logic

Signed-off-by: Bommu Krishnaiah 
Signed-off-by: Daniele Ceraolo Spurio

Cc: Telukuntla Sreedhar 
Cc: Kondapally Kalyan 
Cc: Gupta Anshuman 
Cc: Huang Sean Z 
Cc: Chris Wilson 
---
   drivers/gpu/drm/i915/gem/i915_gem_create.c    | 27 +++--
   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 10 +
   drivers/gpu/drm/i915/gem/i915_gem_object.c    |  6 +++
   drivers/gpu/drm/i915/gem/i915_gem_object.h    | 12 ++
   .../gpu/drm/i915/gem/i915_gem_object_types.h  | 13 ++
   drivers/gpu/drm/i915/pxp/intel_pxp.c
| 40 +++
   drivers/gpu/drm/i915/pxp/intel_pxp.h  | 13 ++
   drivers/gpu/drm/i915/pxp/intel_pxp_types.h    |  5 +++
   include/uapi/drm/i915_drm.h   | 22 ++
   9 files changed, 145 insertions(+), 3 deletions(-)

diff --git
a/drivers/gpu/drm/i915/gem/i915_gem_create.c
b/drivers/gpu/drm/i915/gem/i915_gem_create.c
index 3ad3413c459f..d02e5938afbe 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_create.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_create.c
@@ -5,6 +5,7 @@
     #include "gem/i915_gem_ioctls.h"
   #include "gem/i915_gem_region.h"
+#include "pxp/intel_pxp.h"
     #include "i915_drv.h"
   #include "i915_user_extensions.h"
@@ -13,7 +14,8 @@ static int
   i915_gem_create(struct drm_file *file,
   struct intel_memory_region *mr,
   u64 *size_p,
-    u32 *handle_p)
+    u32 *handle_p,
+    u64 user_flags)
   {
   struct drm_i915_gem_object *obj;
   u32 handle;
@@ -35,12 +37,17 @@ i915_gem_create(struct drm_file *file,
     GEM_BUG_ON(size != obj->base.size);
   +    obj->user_flags = user_flags;
+
   ret = drm_gem_handle_create(file, >base, );
   /* drop reference from allocate - handle holds it now */
   i915_gem_object_put(obj);
   if (ret)
   return ret;
   +    if (user_flags & I915_GEM_OBJECT_PROTECTED)
+    intel_pxp_object_add(obj);
+
   *handle_p = handle;
   *size_p = size;
   return 0;
@@ -89,11 +96,12 @@ i915_gem_dumb_create(struct drm_file *file,
   return i915_gem_create(file,
intel_memory_region_by_type(to_i915(dev),
  mem_type),
-   >size, >handle);
+   >size, >handle, 0);
   }
     struct create_ext {
   struct drm_i915_private *i915;
+    unsigned long user_flags;
   };
     static int __create_setparam(struct
drm_i915_gem_object_param *args,
@@ -104,6 +112,19 @@ static int
__create_setparam(struct
drm_i915_gem_object_param *args,
   return -EINVAL;
   }
   +    switch (lower_32_bits(args->param)) {
+    case I915_OBJECT_PARAM_PROTECTED_CONTENT:
+    if (!intel_pxp_is_enabled(_data->i915->gt.pxp))
+    return -ENODEV;
+    if (args->size) {
+    return -EINVAL;
+    } else if (args->data) {
+    ext_data->user_flags |= I915_GEM_OBJECT_PROTECTED;
+    return 0;
+    }
+    break;
+    }
+
   return -EINVAL;
   }
   @@ -148,5 +169,5 @@
i915_gem_create_ioctl(struct drm_device *dev,
void *data,
   return i915_gem_create(file,
  intel_memory_region_by_type(i915,
  INTEL_MEMORY_SYSTEM),
-   >size, >handle);
+   >size, >handle,
ext_data.user_flags);
   }
diff --git
a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index e503c9f789c0..d10c4fcb6aec 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -20,6 +20,7 @@
   #include "gt/intel_gt_buffer_pool.h"
   #include "gt/intel_gt_pm.h"
   #include "gt/intel_ring.h"
+#include "pxp/intel_pxp.h"
     #include "pxp/intel_pxp.h"
   @@ -500,6 +501,15 @@ eb_va

Re: [Intel-gfx] [PATCH] i915/query: Correlate engine and cpu timestamps with better accuracy

2021-03-04 Thread Lionel Landwerlin


On 03/03/2021 23:28, Umesh Nerlige Ramappa wrote:

Perf measurements rely on CPU and engine timestamps to correlate
events of interest across these time domains. Current mechanisms get
these timestamps separately and the calculated delta between these
timestamps lack enough accuracy.

To improve the accuracy of these time measurements to within a few us,
add a query that returns the engine and cpu timestamps captured as
close to each other as possible.

v2: (Tvrtko)
- document clock reference used
- return cpu timestamp always
- capture cpu time just before lower dword of cs timestamp

v3: (Chris)
- use uncore-rpm
- use __query_cs_timestamp helper

v4: (Lionel)
- Kernel perf subsytem allows users to specify the clock id to be used
   in perf_event_open. This clock id is used by the perf subsystem to
   return the appropriate cpu timestamp in perf events. Similarly, let
   the user pass the clockid to this query so that cpu timestamp
   corresponds to the clock id requested.

v5: (Tvrtko)
- Use normal ktime accessors instead of fast versions
- Add more uApi documentation

v6: (Lionel)
- Move switch out of spinlock

v7: (Chris)
- cs_timestamp is a misnomer, use cs_cycles instead
- return the cs cycle frequency as well in the query

v8:
- Add platform and engine specific checks

v9: (Lionel)
- Return 2 cpu timestamps in the query - captured before and after the
   register read

Signed-off-by: Umesh Nerlige Ramappa
---
  drivers/gpu/drm/i915/i915_query.c | 144 ++


FYI, the MR for Mesa : 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9407



-Lionel

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2230 matches

Mail list logo