Re: [PATCH v2] drm/xe/guc: Add GuC based register capture for error capture

2024-01-16 Thread Dong, Zhanjun



Please ignore this patch
The patch is for Xe upstream, sent to wrong ML.

Regards,
Zhanjun
On 2024-01-16 12:12 p.m., Zhanjun Dong wrote:

Port GuC based register capture for error capture from i915 to Xe.

There are 3 parts in this commit:
. Prepare for capture registers
 There is a bo create at guc ads init time, that is very early
 and engine map is not ready, make it hard to calculate the
 capture buffer size, new function created for worst case size
 caluation. Other than that, this part basically follows the i915
 design.
. Process capture notification message
 Basically follows i915 design
. Sysfs command process.
 Xe switched to devcoredump, adopted command line process with
 captured list.

Signed-off-by: Zhanjun Dong 

Zhanjun Dong (9):
   drm/xe/guc: Add register defines for GuC based register capture
   drm/xe/guc: Expose dss per group for GuC error capture
   drm/xe/guc: Update GuC ADS size for error capture
   drm/xe/guc: Add XE_LP steered register lists
   drm/xe/guc: Add capture size check in GuC log buffer
   drm/xe/guc: Check sizing of guc_capture output
   drm/xe/guc: Extract GuC error capture lists on G2H notification
   drm/xe/guc: Pre-allocate output nodes for extraction
   drm/xe/guc: Plumb GuC-capture into dev coredump

  drivers/gpu/drm/xe/Kconfig   |   11 +
  drivers/gpu/drm/xe/Makefile  |1 +
  drivers/gpu/drm/xe/abi/guc_actions_abi.h |7 +
  drivers/gpu/drm/xe/regs/xe_engine_regs.h |   12 +
  drivers/gpu/drm/xe/regs/xe_gt_regs.h |   20 +
  drivers/gpu/drm/xe/xe_gt_mcr.c   |4 +-
  drivers/gpu/drm/xe/xe_gt_mcr.h   |1 +
  drivers/gpu/drm/xe/xe_gt_printk.h|3 +
  drivers/gpu/drm/xe/xe_gt_topology.c  |3 -
  drivers/gpu/drm/xe/xe_guc.c  |7 +
  drivers/gpu/drm/xe/xe_guc_ads.c  |  255 +++-
  drivers/gpu/drm/xe/xe_guc_ads_types.h|2 +
  drivers/gpu/drm/xe/xe_guc_capture.c  | 1404 ++
  drivers/gpu/drm/xe/xe_guc_capture.h  |   35 +
  drivers/gpu/drm/xe/xe_guc_capture_fwif.h |  225 
  drivers/gpu/drm/xe/xe_guc_ct.c   |2 +
  drivers/gpu/drm/xe/xe_guc_fwif.h |   68 ++
  drivers/gpu/drm/xe/xe_guc_log.c  |  179 +++
  drivers/gpu/drm/xe/xe_guc_log.h  |   15 +
  drivers/gpu/drm/xe/xe_guc_log_types.h|   26 +
  drivers/gpu/drm/xe/xe_guc_submit.c   |   22 +-
  drivers/gpu/drm/xe/xe_guc_submit.h   |3 +
  drivers/gpu/drm/xe/xe_guc_types.h|2 +
  drivers/gpu/drm/xe/xe_hw_engine.c|   73 +-
  drivers/gpu/drm/xe/xe_hw_engine_types.h  |  106 +-
  25 files changed, 2355 insertions(+), 131 deletions(-)
  create mode 100644 drivers/gpu/drm/xe/xe_guc_capture.c
  create mode 100644 drivers/gpu/drm/xe/xe_guc_capture.h
  create mode 100644 drivers/gpu/drm/xe/xe_guc_capture_fwif.h



Re: [Intel-gfx] [PATCH] drm/i915: Skip pxp init if gt is wedged

2023-11-01 Thread Dong, Zhanjun

On 2023-10-31 5:38 p.m., Teres Alexis, Alan Previn wrote:

On Fri, 2023-10-27 at 10:13 +0300, Jani Nikula wrote:

On Thu, 26 Oct 2023, Zhanjun Dong  wrote:


alan:snip

I'll note that nobody checks intel_pxp_init() return status, so this
silently skips PXP.

BR,
Jani.


alan:snip

+   if (intel_gt_is_wedged(gt))
+   return -ENODEV;
+


alan: wondering if we can add a drm_dbg in the caller of intel_pxp_init and
use a unique return value for the case of gt_is_wedged (for example: -ENXIO.).
As we know gt being wedged at startup basically means all gt usage is dead
and therefore we cant enable PXP (along with everything else that needs 
submission/
guc/ etc). With a drm-debug in the caller that prints that return value, it
helps us to differentiate between gt-is-wedged vs platform config doesnt support
PXP. However, this would mean new drm-debug 'noise' for platforms that i915 just
doesn't support PXP on at all which would be okay if dont use drm_warn or 
drm_err
and use 'softer' message like "PXP skipped with %d".

Please treat above comment as a "nit" - i.e. existing patch is good enough for 
me...
(after addressing Jani's request for more commit message info). ...alan




Thanks Alan.
I agree, add more drm-debug looks like add noise in case of 
gt_is_wedged, existing code already output useful info.


If logs already let us know gt_wedged happens and we are not expect pxp 
init running on gt wedged condition, then silently skip pxp_init looks 
like match the expectation.


I will re-post with updated commit message later.

Regards,
Zhanjun


Re: [Intel-gfx] [PATCH 3/3] drm/i915/mtl: Add counters for engine busyness ticks

2023-10-19 Thread Dong, Zhanjun

See comments inline below.

Zhanjun

On 2023-09-22 6:25 p.m., john.c.harri...@intel.com wrote:

From: Umesh Nerlige Ramappa 

In new version of GuC engine busyness, GuC provides engine busyness
ticks as a 64 bit counter. Add a new counter to relay this value to the
user as is.

Signed-off-by: Umesh Nerlige Ramappa 
Signed-off-by: John Harrison 
---
  drivers/gpu/drm/i915/gt/intel_engine.h|  1 +
  drivers/gpu/drm/i915/gt/intel_engine_cs.c | 16 +
  drivers/gpu/drm/i915/gt/intel_engine_types.h  | 12 
  drivers/gpu/drm/i915/gt/intel_engine_user.c   |  1 +
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 67 ++-
  drivers/gpu/drm/i915/i915_pmu.c   | 25 ++-
  drivers/gpu/drm/i915/i915_pmu.h   |  2 +-
  include/uapi/drm/i915_drm.h   | 13 +++-
  8 files changed, 116 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h 
b/drivers/gpu/drm/i915/gt/intel_engine.h
index b58c30ac8ef02..57af7ec8ecd82 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -249,6 +249,7 @@ void intel_engine_dump_active_requests(struct list_head 
*requests,
  
  ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine,

   ktime_t *now);
+u64 intel_engine_get_busy_ticks(struct intel_engine_cs *engine);
  
  void intel_engine_get_hung_entity(struct intel_engine_cs *engine,

  struct intel_context **ce, struct 
i915_request **rq);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 84a75c95f3f7d..1c9ffb1ae9889 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2426,6 +2426,22 @@ ktime_t intel_engine_get_busy_time(struct 
intel_engine_cs *engine, ktime_t *now)
return engine->busyness(engine, now);
  }
  
+/**

+ * intel_engine_get_busy_ticks() - Return current accumulated engine busyness
+ * ticks
+ * @engine: engine to report on
+ *
+ * Returns accumulated ticks @engine was busy since engine stats were enabled.
+ */
+u64 intel_engine_get_busy_ticks(struct intel_engine_cs *engine)
+{
+   if (!engine->busyness_ticks ||
+   !(engine->flags & I915_ENGINE_SUPPORTS_TICKS_STATS))
+   return 0;
+
+   return engine->busyness_ticks(engine);
+}
+
  struct intel_context *
  intel_engine_create_virtual(struct intel_engine_cs **siblings,
unsigned int count, unsigned long flags)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 40fd8f984d64b..a88d40c74d604 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -548,6 +548,11 @@ struct intel_engine_cs {
ktime_t (*busyness)(struct intel_engine_cs *engine,
ktime_t *now);
  
+	/*

+* Get engine busyness ticks
+*/
+   u64 (*busyness_ticks)(struct intel_engine_cs *engine);
+
struct intel_engine_execlists execlists;
  
  	/*

@@ -574,6 +579,7 @@ struct intel_engine_cs {
  #define I915_ENGINE_HAS_EU_PRIORITYBIT(10)
  #define I915_ENGINE_FIRST_RENDER_COMPUTE BIT(11)
  #define I915_ENGINE_USES_WA_HOLD_CCS_SWITCHOUT BIT(12)
+#define I915_ENGINE_SUPPORTS_TICKS_STATS   BIT(13)
unsigned int flags;
  
  	/*

@@ -649,6 +655,12 @@ intel_engine_supports_stats(const struct intel_engine_cs 
*engine)
return engine->flags & I915_ENGINE_SUPPORTS_STATS;
  }
  
+static inline bool

+intel_engine_supports_tick_stats(const struct intel_engine_cs *engine)
+{
+   return engine->flags & I915_ENGINE_SUPPORTS_TICKS_STATS;
+}
+
  static inline bool
  intel_engine_has_preemption(const struct intel_engine_cs *engine)
  {
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c 
b/drivers/gpu/drm/i915/gt/intel_engine_user.c
index dcedff41a825f..69eb610b5ab0a 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_user.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c
@@ -100,6 +100,7 @@ static void set_scheduler_caps(struct drm_i915_private 
*i915)
MAP(HAS_PREEMPTION, PREEMPTION),
MAP(HAS_SEMAPHORES, SEMAPHORES),
MAP(SUPPORTS_STATS, ENGINE_BUSY_STATS),
+   MAP(SUPPORTS_TICKS_STATS, ENGINE_BUSY_TICKS_STATS),
  #undef MAP
};
struct intel_engine_cs *engine;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 0c1fee5360777..71749fb9ad35b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1289,12 +1289,7 @@ static void busy_v1_guc_update_pm_timestamp(struct 
intel_guc *guc, ktime_t *now)
guc->busy.v1.gt_stamp = ((u64)gt_stamp_hi << 32) | gt_stamp_lo;
  }
  
-/*

- * Unlike the execlist mode of 

Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-22 Thread Dong, Zhanjun


> -Original Message-
> From: Daniel Vetter 
> Sent: August 22, 2023 9:51 AM
> To: Dong, Zhanjun 
> Cc: intel-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; 
> Harrison,
> John C ; Andi Shyti ;
> Daniel Vetter 
> Subject: Re: [PATCH v5] drm/i915: Avoid circular locking dependency when
> flush delayed work on gt reset
> 
> On Fri, Aug 11, 2023 at 11:20:11AM -0700, Zhanjun Dong wrote:
> > This attempts to avoid circular locking dependency between flush delayed
> > work and intel_gt_reset.
> > When intel_gt_reset was called, task will hold a lock.
> > To cacel delayed work here, the _sync version will also acquire a lock,
> > which might trigger the possible cirular locking dependency warning.
> > When intel_gt_reset called, reset_in_progress flag will be set, add code
> > to check the flag, call async verion if reset is in progress.
> >
> > Signed-off-by: Zhanjun Dong 
> > Cc: John Harrison 
> > Cc: Andi Shyti 
> > Cc: Daniel Vetter 
> > ---
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index a0e3ef1c65d2..600388c849f7 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct
> intel_guc *guc)
> >
> >  static void guc_cancel_busyness_worker(struct intel_guc *guc)
> >  {
> > -   cancel_delayed_work_sync(>timestamp.work);
> > +   /*
> > +* When intel_gt_reset was called, task will hold a lock.
> > +* To cacel delayed work here, the _sync version will also acquire a 
> > lock,
> which might
> > +* trigger the possible cirular locking dependency warning.
> 
> This is not even close to a locking bugfix. Consider this a formal nack,
> because the issue here is not even close to "needs more comments to
> explain what's going on".
> -Daniel

The purpose of the comment here it is to explain locking issue condition
> 
> > +* Check the reset_in_progress flag, call async verion if reset is in
> progress.


The comment here explains check with the flag to avoid locking condition.
The reset process is not considered to be complete in short time, other than 
that, do we missed anything?

> > +*/
> > +   if (guc_to_gt(guc)->uc.reset_in_progress)
> > +   cancel_delayed_work(>timestamp.work);
> > +   else
> > +   cancel_delayed_work_sync(>timestamp.work);
> >  }
> >
> >  static void __reset_guc_busyness_stats(struct intel_guc *guc)
> > --
> > 2.34.1
> >
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch


Re: [Intel-gfx] [PATCH v4] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-08 Thread Dong, Zhanjun

Hi Daniel,


On 2023-08-03 9:03 a.m., Daniel Vetter wrote:

On Thu, 27 Jul 2023 at 22:13, Zhanjun Dong  wrote:


This attempts to avoid circular locking dependency between flush delayed work 
and intel_gt_reset.
Switched from cancel_delayed_work_sync to cancel_delayed_work, the non-sync 
version for reset path, it is safe as the worker has the trylock code to handle 
the lock; Meanwhile keep the sync version for park/fini to ensure the worker is 
not still running during suspend or shutdown.

WARNING: possible circular locking dependency detected
6.4.0-rc1-drmtip_1340-g31e3463b0edb+ #1 Not tainted
--
kms_pipe_crc_ba/6415 is trying to acquire lock:
88813e6cc640 
((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}, at: 
__flush_work+0x42/0x530

but task is already holding lock:
88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0x19e/0x470 
[i915]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (>reset.mutex){+.+.}-{3:3}:
 lock_acquire+0xd8/0x2d0
 i915_gem_shrinker_taints_mutex+0x31/0x50 [i915]
 intel_gt_init_reset+0x65/0x80 [i915]
 intel_gt_common_init_early+0xe1/0x170 [i915]
 intel_root_gt_init_early+0x48/0x60 [i915]
 i915_driver_probe+0x671/0xcb0 [i915]
 i915_pci_probe+0xdc/0x210 [i915]
 pci_device_probe+0x95/0x120
 really_probe+0x164/0x3c0
 __driver_probe_device+0x73/0x160
 driver_probe_device+0x19/0xa0
 __driver_attach+0xb6/0x180
 bus_for_each_dev+0x77/0xd0
 bus_add_driver+0x114/0x210
 driver_register+0x5b/0x110
 __pfx_vgem_open+0x3/0x10 [vgem]
 do_one_initcall+0x57/0x270
 do_init_module+0x5f/0x220
 load_module+0x1ca4/0x1f00
 __do_sys_finit_module+0xb4/0x130
 do_syscall_64+0x3c/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

-> #2 (fs_reclaim){+.+.}-{0:0}:
 lock_acquire+0xd8/0x2d0
 fs_reclaim_acquire+0xac/0xe0
 kmem_cache_alloc+0x32/0x260
 i915_vma_instance+0xb2/0xc60 [i915]
 i915_gem_object_ggtt_pin_ww+0x175/0x370 [i915]
 vm_fault_gtt+0x22d/0xf60 [i915]
 __do_fault+0x2f/0x1d0
 do_pte_missing+0x4a/0xd20
 __handle_mm_fault+0x5b0/0x790
 handle_mm_fault+0xa2/0x230
 do_user_addr_fault+0x3ea/0xa10
 exc_page_fault+0x68/0x1a0
 asm_exc_page_fault+0x26/0x30

-> #1 (>reset.backoff_srcu){}-{0:0}:
 lock_acquire+0xd8/0x2d0
 _intel_gt_reset_lock+0x57/0x330 [i915]
 guc_timestamp_ping+0x35/0x130 [i915]
 process_one_work+0x250/0x510
 worker_thread+0x4f/0x3a0
 kthread+0xff/0x130
 ret_from_fork+0x29/0x50

-> #0 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}:
 check_prev_add+0x90/0xc60
 __lock_acquire+0x1998/0x2590
 lock_acquire+0xd8/0x2d0
 __flush_work+0x74/0x530
 __cancel_work_timer+0x14f/0x1f0
 intel_guc_submission_reset_prepare+0x81/0x4b0 [i915]
 intel_uc_reset_prepare+0x9c/0x120 [i915]
 reset_prepare+0x21/0x60 [i915]
 intel_gt_reset+0x1dd/0x470 [i915]
 intel_gt_reset_global+0xfb/0x170 [i915]
 intel_gt_handle_error+0x368/0x420 [i915]
 intel_gt_debugfs_reset_store+0x5c/0xc0 [i915]
 i915_wedged_set+0x29/0x40 [i915]
 simple_attr_write_xsigned.constprop.0+0xb4/0x110
 full_proxy_write+0x52/0x80
 vfs_write+0xc5/0x4f0
 ksys_write+0x64/0xe0
 do_syscall_64+0x3c/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

other info that might help us debug this:
  Chain exists of:
   (work_completion)(&(>timestamp.work)->work) --> fs_reclaim --> 
>reset.mutex
   Possible unsafe locking scenario:
 CPU0CPU1
 
lock(>reset.mutex);
 lock(fs_reclaim);
 lock(>reset.mutex);
lock((work_completion)(&(>timestamp.work)->work));

  *** DEADLOCK ***
  3 locks held by kms_pipe_crc_ba/6415:
   #0: 888101541430 (sb_writers#15){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
   #1: 888136c7eab8 (>mutex){+.+.}-{3:3}, at: 
simple_attr_write_xsigned.constprop.0+0x47/0x110
   #2: 88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: 
intel_gt_reset+0x19e/0x470 [i915]

v2: Add sync flag to guc_cancel_busyness_worker to ensure reset path calls 
asynchronous cancel.
v3: Add sync flag to intel_guc_submission_disable to ensure reset path calls 
asynchronous cancel.
v4: Set to always sync from __uc_fini_hw path.

Signed-off-by: Zhanjun Dong 
Cc: John Harrison 
Cc: Andi Shyti 
---
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 ++---
  .../gpu/drm/i915/gt/uc/intel_guc_submission.h   |  2 +-
  drivers/gpu/drm/i915/gt/uc/intel_uc.c   |  4 ++--
  3 files changed, 13 insertions(+), 10 

Re: [Intel-gfx] [PATCH v4] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-08-08 Thread Dong, Zhanjun

Hi Andi,


On 2023-08-03 8:36 a.m., Andi Shyti wrote:

Hi Zhanjun,

On Thu, Jul 27, 2023 at 01:13:23PM -0700, Zhanjun Dong wrote:

This attempts to avoid circular locking dependency between flush delayed work 
and intel_gt_reset.
Switched from cancel_delayed_work_sync to cancel_delayed_work, the non-sync 
version for reset path, it is safe as the worker has the trylock code to handle 
the lock; Meanwhile keep the sync version for park/fini to ensure the worker is 
not still running during suspend or shutdown.


Next time, please wrap the sentences to 65 characters (standing
to the e-mail netiquette, RFC1855[1]) or 70-75 characters
(standing to the kernel guidelines[2]).

[1] https://www.ietf.org/rfc/rfc1855.txt
 chapter "2.1.1 For mail", page 3
[2] https://docs.kernel.org/process/submitting-patches.html
 chapter "The canonical patch format"



Thanks, will be fixed in next revision.


WARNING: possible circular locking dependency detected
6.4.0-rc1-drmtip_1340-g31e3463b0edb+ #1 Not tainted
--
kms_pipe_crc_ba/6415 is trying to acquire lock:
88813e6cc640 
((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}, at: 
__flush_work+0x42/0x530

but task is already holding lock:
88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0x19e/0x470 
[i915]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (>reset.mutex){+.+.}-{3:3}:
 lock_acquire+0xd8/0x2d0
 i915_gem_shrinker_taints_mutex+0x31/0x50 [i915]
 intel_gt_init_reset+0x65/0x80 [i915]
 intel_gt_common_init_early+0xe1/0x170 [i915]
 intel_root_gt_init_early+0x48/0x60 [i915]
 i915_driver_probe+0x671/0xcb0 [i915]
 i915_pci_probe+0xdc/0x210 [i915]
 pci_device_probe+0x95/0x120
 really_probe+0x164/0x3c0
 __driver_probe_device+0x73/0x160
 driver_probe_device+0x19/0xa0
 __driver_attach+0xb6/0x180
 bus_for_each_dev+0x77/0xd0
 bus_add_driver+0x114/0x210
 driver_register+0x5b/0x110
 __pfx_vgem_open+0x3/0x10 [vgem]
 do_one_initcall+0x57/0x270
 do_init_module+0x5f/0x220
 load_module+0x1ca4/0x1f00
 __do_sys_finit_module+0xb4/0x130
 do_syscall_64+0x3c/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

-> #2 (fs_reclaim){+.+.}-{0:0}:
 lock_acquire+0xd8/0x2d0
 fs_reclaim_acquire+0xac/0xe0
 kmem_cache_alloc+0x32/0x260
 i915_vma_instance+0xb2/0xc60 [i915]
 i915_gem_object_ggtt_pin_ww+0x175/0x370 [i915]
 vm_fault_gtt+0x22d/0xf60 [i915]
 __do_fault+0x2f/0x1d0
 do_pte_missing+0x4a/0xd20
 __handle_mm_fault+0x5b0/0x790
 handle_mm_fault+0xa2/0x230
 do_user_addr_fault+0x3ea/0xa10
 exc_page_fault+0x68/0x1a0
 asm_exc_page_fault+0x26/0x30

-> #1 (>reset.backoff_srcu){}-{0:0}:
 lock_acquire+0xd8/0x2d0
 _intel_gt_reset_lock+0x57/0x330 [i915]
 guc_timestamp_ping+0x35/0x130 [i915]
 process_one_work+0x250/0x510
 worker_thread+0x4f/0x3a0
 kthread+0xff/0x130
 ret_from_fork+0x29/0x50

-> #0 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}:
 check_prev_add+0x90/0xc60
 __lock_acquire+0x1998/0x2590
 lock_acquire+0xd8/0x2d0
 __flush_work+0x74/0x530
 __cancel_work_timer+0x14f/0x1f0
 intel_guc_submission_reset_prepare+0x81/0x4b0 [i915]
 intel_uc_reset_prepare+0x9c/0x120 [i915]
 reset_prepare+0x21/0x60 [i915]
 intel_gt_reset+0x1dd/0x470 [i915]
 intel_gt_reset_global+0xfb/0x170 [i915]
 intel_gt_handle_error+0x368/0x420 [i915]
 intel_gt_debugfs_reset_store+0x5c/0xc0 [i915]
 i915_wedged_set+0x29/0x40 [i915]
 simple_attr_write_xsigned.constprop.0+0xb4/0x110
 full_proxy_write+0x52/0x80
 vfs_write+0xc5/0x4f0
 ksys_write+0x64/0xe0
 do_syscall_64+0x3c/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

other info that might help us debug this:
  Chain exists of:
   (work_completion)(&(>timestamp.work)->work) --> fs_reclaim --> 
>reset.mutex
   Possible unsafe locking scenario:
 CPU0CPU1
 
lock(>reset.mutex);
 lock(fs_reclaim);
 lock(>reset.mutex);
lock((work_completion)(&(>timestamp.work)->work));

  *** DEADLOCK ***
  3 locks held by kms_pipe_crc_ba/6415:
   #0: 888101541430 (sb_writers#15){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
   #1: 888136c7eab8 (>mutex){+.+.}-{3:3}, at: 
simple_attr_write_xsigned.constprop.0+0x47/0x110
   #2: 88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: 
intel_gt_reset+0x19e/0x470 [i915]

v2: Add sync flag to guc_cancel_busyness_worker to ensure reset path calls 
asynchronous cancel.
v3: Add sync flag to 

Re: [Intel-gfx] [PATCH] drm/i915/gt: Remove incorrect hard coded cache coherrency setting

2023-06-23 Thread Dong, Zhanjun

Hi Fei,

Thanks for review. I put my answers inline below.

Regards,

Zhanjun

On 2023-06-22 6:20 p.m., Yang, Fei wrote:
> The previouse i915_gem_object_create_internal already set it with 
proper

> value before function return. This hard coded setting is incorrect for
> platforms like MTL, thus need to be removed.
>
> Signed-off-by: Zhanjun Dong 
> ---
>  drivers/gpu/drm/i915/gt/intel_timeline.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c 
b/drivers/gpu/drm/i915/gt/intel_timeline.c

> index b9640212d659..693d18e14b00 100644
> --- a/drivers/gpu/drm/i915/gt/intel_timeline.c
> +++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
> @@ -26,8 +26,6 @@ static struct i915_vma *hwsp_alloc(struct intel_gt 
*gt)

>          if (IS_ERR(obj))
>                  return ERR_CAST(obj);
>
> - i915_gem_object_set_cache_coherency(obj, I915_CACHE_LLC);
> -

Does this change really fix the coherency issue?


Testing in progress. Issue reported by E2E team, now is their public 
holiday. Meanwhile, I have trouble to run the test case on my setup. 
Need to sync with them later.




I consulted with Chris and he said that the hwsp is purposely set to be
cacheable. The mapping on CPU side also indicates it's cacheable,


For single end access area that setting works well. Here the problem is 
the head/tail memory area requires different cache setting.


As  the previous i915_gem_object_create_internal already set the cache 
setting for current platform properly, why we overwrite it here?





intel_timeline_pin_map(struct intel_timeline *timeline)
        {
                struct drm_i915_gem_object *obj = 
timeline->hwsp_ggtt->obj;

                u32 ofs = offset_in_page(timeline->hwsp_offset);
                void *vaddr;

                vaddr = i915_gem_object_pin_map(obj, I915_MAP_WB);
                ...
        }

Maybe we should also set it to match platform as well?


>          vma = i915_vma_instance(obj, >ggtt->vm, NULL);
>          if (IS_ERR(vma))
>  i915_gem_object_put(obj);
> --
> 2.34.1


Re: [Intel-gfx] [PATCH] drm/i915/gt: Remove incorrect hard coded cache coherrency setting

2023-06-22 Thread Dong, Zhanjun

Resend to restart the CI,

https://patchwork.freedesktop.org/series/119485/

Was stuck.


Regards,

Zhanjun

On 2023-06-22 11:26 a.m., Zhanjun Dong wrote:

The previouse i915_gem_object_create_internal already set it with proper
value before function return. This hard coded setting is incorrect for
platforms like MTL, thus need to be removed.

Signed-off-by: Zhanjun Dong 
---
  drivers/gpu/drm/i915/gt/intel_timeline.c | 2 --
  1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c 
b/drivers/gpu/drm/i915/gt/intel_timeline.c
index b9640212d659..693d18e14b00 100644
--- a/drivers/gpu/drm/i915/gt/intel_timeline.c
+++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
@@ -26,8 +26,6 @@ static struct i915_vma *hwsp_alloc(struct intel_gt *gt)
if (IS_ERR(obj))
return ERR_CAST(obj);
  
-	i915_gem_object_set_cache_coherency(obj, I915_CACHE_LLC);

-
vma = i915_vma_instance(obj, >ggtt->vm, NULL);
if (IS_ERR(vma))
i915_gem_object_put(obj);


Re: [Intel-gfx] [PATCH v3] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-06-15 Thread Dong, Zhanjun
V3 is to follow John's suggestion option 1. The better option is in 
discussion and might have boarder impact.


Meanwhile we can start with option 1, check CI system report and see if 
issue getting better.



Regards,

Zhanjun Dong

On 2023-06-15 5:15 p.m., Zhanjun Dong wrote:

This attempts to avoid circular locking dependency between flush delayed work 
and intel_gt_reset.
Switched from cancel_delayed_work_sync to cancel_delayed_work, the non-sync 
version for reset path, it is safe as the worker has the trylock code to handle 
the lock; Meanwhile keep the sync version for park/fini to ensure the worker is 
not still running during suspend or shutdown.

WARNING: possible circular locking dependency detected
6.4.0-rc1-drmtip_1340-g31e3463b0edb+ #1 Not tainted
--
kms_pipe_crc_ba/6415 is trying to acquire lock:
88813e6cc640 
((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}, at: 
__flush_work+0x42/0x530

but task is already holding lock:
88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0x19e/0x470 
[i915]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (>reset.mutex){+.+.}-{3:3}:
 lock_acquire+0xd8/0x2d0
 i915_gem_shrinker_taints_mutex+0x31/0x50 [i915]
 intel_gt_init_reset+0x65/0x80 [i915]
 intel_gt_common_init_early+0xe1/0x170 [i915]
 intel_root_gt_init_early+0x48/0x60 [i915]
 i915_driver_probe+0x671/0xcb0 [i915]
 i915_pci_probe+0xdc/0x210 [i915]
 pci_device_probe+0x95/0x120
 really_probe+0x164/0x3c0
 __driver_probe_device+0x73/0x160
 driver_probe_device+0x19/0xa0
 __driver_attach+0xb6/0x180
 bus_for_each_dev+0x77/0xd0
 bus_add_driver+0x114/0x210
 driver_register+0x5b/0x110
 __pfx_vgem_open+0x3/0x10 [vgem]
 do_one_initcall+0x57/0x270
 do_init_module+0x5f/0x220
 load_module+0x1ca4/0x1f00
 __do_sys_finit_module+0xb4/0x130
 do_syscall_64+0x3c/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

-> #2 (fs_reclaim){+.+.}-{0:0}:
 lock_acquire+0xd8/0x2d0
 fs_reclaim_acquire+0xac/0xe0
 kmem_cache_alloc+0x32/0x260
 i915_vma_instance+0xb2/0xc60 [i915]
 i915_gem_object_ggtt_pin_ww+0x175/0x370 [i915]
 vm_fault_gtt+0x22d/0xf60 [i915]
 __do_fault+0x2f/0x1d0
 do_pte_missing+0x4a/0xd20
 __handle_mm_fault+0x5b0/0x790
 handle_mm_fault+0xa2/0x230
 do_user_addr_fault+0x3ea/0xa10
 exc_page_fault+0x68/0x1a0
 asm_exc_page_fault+0x26/0x30

-> #1 (>reset.backoff_srcu){}-{0:0}:
 lock_acquire+0xd8/0x2d0
 _intel_gt_reset_lock+0x57/0x330 [i915]
 guc_timestamp_ping+0x35/0x130 [i915]
 process_one_work+0x250/0x510
 worker_thread+0x4f/0x3a0
 kthread+0xff/0x130
 ret_from_fork+0x29/0x50

-> #0 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}:
 check_prev_add+0x90/0xc60
 __lock_acquire+0x1998/0x2590
 lock_acquire+0xd8/0x2d0
 __flush_work+0x74/0x530
 __cancel_work_timer+0x14f/0x1f0
 intel_guc_submission_reset_prepare+0x81/0x4b0 [i915]
 intel_uc_reset_prepare+0x9c/0x120 [i915]
 reset_prepare+0x21/0x60 [i915]
 intel_gt_reset+0x1dd/0x470 [i915]
 intel_gt_reset_global+0xfb/0x170 [i915]
 intel_gt_handle_error+0x368/0x420 [i915]
 intel_gt_debugfs_reset_store+0x5c/0xc0 [i915]
 i915_wedged_set+0x29/0x40 [i915]
 simple_attr_write_xsigned.constprop.0+0xb4/0x110
 full_proxy_write+0x52/0x80
 vfs_write+0xc5/0x4f0
 ksys_write+0x64/0xe0
 do_syscall_64+0x3c/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

other info that might help us debug this:
  Chain exists of:
   (work_completion)(&(>timestamp.work)->work) --> fs_reclaim --> 
>reset.mutex
   Possible unsafe locking scenario:
 CPU0CPU1
 
lock(>reset.mutex);
 lock(fs_reclaim);
 lock(>reset.mutex);
lock((work_completion)(&(>timestamp.work)->work));

  *** DEADLOCK ***
  3 locks held by kms_pipe_crc_ba/6415:
   #0: 888101541430 (sb_writers#15){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
   #1: 888136c7eab8 (>mutex){+.+.}-{3:3}, at: 
simple_attr_write_xsigned.constprop.0+0x47/0x110
   #2: 88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: 
intel_gt_reset+0x19e/0x470 [i915]

v2: Add sync flag to guc_cancel_busyness_worker to ensure reset path calls 
asynchronous cancel.
v3: Add sync flag to intel_guc_submission_disable to ensure reset path calls 
asynchronous cancel.

Signed-off-by: Zhanjun Dong 
---
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 ++---
  .../gpu/drm/i915/gt/uc/intel_guc_submission.h   |  2 +-
  

Re: [Intel-gfx] [PATCH] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-06-08 Thread Dong, Zhanjun
John mentioned 2 options:

  1.  “add the sync/async flag to _submission_disable()”

Consider to be a small change

  1.  add an 'are busyness stats enabled' boolean to the guc structure

Seems effected area among the flow and is much more than option 1.

I would like to discuss a bit more before moving forward.

Regards,
Zhanjun Dong

From: Harrison, John C 
Sent: June 7, 2023 4:17 PM
To: Dong, Zhanjun ; intel-gfx@lists.freedesktop.org; 
dri-de...@lists.freedesktop.org; Nerlige Ramappa, Umesh 
; Ceraolo Spurio, Daniele 

Subject: Re: [Intel-gfx] [PATCH] drm/i915: Avoid circular locking dependency 
when flush delayed work on gt reset

On 6/7/2023 12:03, Zhanjun Dong wrote:


This attempts to avoid circular locking dependency between flush delayed work 
and intel_gt_reset.

Switched from cancel_delayed_work_sync to cancel_delayed_work, the non-sync 
version for reset path, it is safe as the worker has the trylock code to handle 
the lock; Meanwhile keep the sync version for park/fini to ensure the worker is 
not still running during suspend or shutdown.



WARNING: possible circular locking dependency detected

6.4.0-rc1-drmtip_1340-g31e3463b0edb+ #1 Not tainted

--

kms_pipe_crc_ba/6415 is trying to acquire lock:

88813e6cc640 
((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}, at: 
__flush_work+0x42/0x530



but task is already holding lock:

88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0x19e/0x470 
[i915]



which lock already depends on the new lock.



the existing dependency chain (in reverse order) is:



-> #3 (>reset.mutex){+.+.}-{3:3}:

lock_acquire+0xd8/0x2d0

i915_gem_shrinker_taints_mutex+0x31/0x50 [i915]

intel_gt_init_reset+0x65/0x80 [i915]

intel_gt_common_init_early+0xe1/0x170 [i915]

intel_root_gt_init_early+0x48/0x60 [i915]

i915_driver_probe+0x671/0xcb0 [i915]

i915_pci_probe+0xdc/0x210 [i915]

pci_device_probe+0x95/0x120

really_probe+0x164/0x3c0

__driver_probe_device+0x73/0x160

driver_probe_device+0x19/0xa0

__driver_attach+0xb6/0x180

bus_for_each_dev+0x77/0xd0

bus_add_driver+0x114/0x210

driver_register+0x5b/0x110

__pfx_vgem_open+0x3/0x10 [vgem]

do_one_initcall+0x57/0x270

do_init_module+0x5f/0x220

load_module+0x1ca4/0x1f00

__do_sys_finit_module+0xb4/0x130

do_syscall_64+0x3c/0x90

entry_SYSCALL_64_after_hwframe+0x72/0xdc



-> #2 (fs_reclaim){+.+.}-{0:0}:

lock_acquire+0xd8/0x2d0

fs_reclaim_acquire+0xac/0xe0

kmem_cache_alloc+0x32/0x260

i915_vma_instance+0xb2/0xc60 [i915]

i915_gem_object_ggtt_pin_ww+0x175/0x370 [i915]

vm_fault_gtt+0x22d/0xf60 [i915]

__do_fault+0x2f/0x1d0

do_pte_missing+0x4a/0xd20

__handle_mm_fault+0x5b0/0x790

handle_mm_fault+0xa2/0x230

do_user_addr_fault+0x3ea/0xa10

exc_page_fault+0x68/0x1a0

asm_exc_page_fault+0x26/0x30



-> #1 (>reset.backoff_srcu){}-{0:0}:

lock_acquire+0xd8/0x2d0

_intel_gt_reset_lock+0x57/0x330 [i915]

guc_timestamp_ping+0x35/0x130 [i915]

process_one_work+0x250/0x510

worker_thread+0x4f/0x3a0

kthread+0xff/0x130

ret_from_fork+0x29/0x50



-> #0 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}:

check_prev_add+0x90/0xc60

__lock_acquire+0x1998/0x2590

lock_acquire+0xd8/0x2d0

__flush_work+0x74/0x530

__cancel_work_timer+0x14f/0x1f0

intel_guc_submission_reset_prepare+0x81/0x4b0 [i915]

intel_uc_reset_prepare+0x9c/0x120 [i915]

reset_prepare+0x21/0x60 [i915]

intel_gt_reset+0x1dd/0x470 [i915]

intel_gt_reset_global+0xfb/0x170 [i915]

intel_gt_handle_error+0x368/0x420 [i915]

intel_gt_debugfs_reset_store+0x5c/0xc0 [i915]

i915_wedged_set+0x29/0x40 [i915]

simple_attr_write_xsigned.constprop.0+0xb4/0x110

full_proxy_write+0x52/0x80

vfs_write+0xc5/0x4f0

ksys_write+0x64/0xe0

do_syscall_64+0x3c/0x90

entry_SYSCALL_64_after_hwframe+0x72/0xdc



other info that might help us debug this:

 Chain exists of:

  (work_completion)(&(>timestamp.work)->work) --> fs_reclaim --> 
>reset.mutex

  Possible unsafe locking scenario:

CPU0CPU1



   lock(>reset.mutex);

lock(fs_reclaim);

lock(>reset.mutex);

   lock((work_completion)(&(>timestamp.work)->work));



 *** DEADLOCK ***

 3 locks held by kms_pipe_crc_ba/6415:

  #0: 888101541430 (sb_writers#15){.+.+}-{0:0}, at: ksys_write+0x64/0xe0

  #1: 888136c7eab8 (>mutex){+.+.}-{3:3}, at: 
simple_attr_write_x

Re: [Intel-gfx] [PATCH] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

2023-06-08 Thread Dong, Zhanjun

Hi Andi,

Thanks for comments. Info would be updated on next revision, which is on 
the way.


Regards,

Zhanjun Dong

On 2023-06-07 8:19 p.m., Andi Shyti wrote:

Hi Dong,

On Wed, Jun 07, 2023 at 12:03:50PM -0700, Zhanjun Dong wrote:

This attempts to avoid circular locking dependency between flush delayed work 
and intel_gt_reset.
Switched from cancel_delayed_work_sync to cancel_delayed_work, the non-sync 
version for reset path, it is safe as the worker has the trylock code to handle 
the lock; Meanwhile keep the sync version for park/fini to ensure the worker is 
not still running during suspend or shutdown.

WARNING: possible circular locking dependency detected
6.4.0-rc1-drmtip_1340-g31e3463b0edb+ #1 Not tainted
--
kms_pipe_crc_ba/6415 is trying to acquire lock:
88813e6cc640 
((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}, at: 
__flush_work+0x42/0x530

but task is already holding lock:
88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0x19e/0x470 
[i915]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (>reset.mutex){+.+.}-{3:3}:
 lock_acquire+0xd8/0x2d0
 i915_gem_shrinker_taints_mutex+0x31/0x50 [i915]
 intel_gt_init_reset+0x65/0x80 [i915]
 intel_gt_common_init_early+0xe1/0x170 [i915]
 intel_root_gt_init_early+0x48/0x60 [i915]
 i915_driver_probe+0x671/0xcb0 [i915]
 i915_pci_probe+0xdc/0x210 [i915]
 pci_device_probe+0x95/0x120
 really_probe+0x164/0x3c0
 __driver_probe_device+0x73/0x160
 driver_probe_device+0x19/0xa0
 __driver_attach+0xb6/0x180
 bus_for_each_dev+0x77/0xd0
 bus_add_driver+0x114/0x210
 driver_register+0x5b/0x110
 __pfx_vgem_open+0x3/0x10 [vgem]
 do_one_initcall+0x57/0x270
 do_init_module+0x5f/0x220
 load_module+0x1ca4/0x1f00
 __do_sys_finit_module+0xb4/0x130
 do_syscall_64+0x3c/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

-> #2 (fs_reclaim){+.+.}-{0:0}:
 lock_acquire+0xd8/0x2d0
 fs_reclaim_acquire+0xac/0xe0
 kmem_cache_alloc+0x32/0x260
 i915_vma_instance+0xb2/0xc60 [i915]
 i915_gem_object_ggtt_pin_ww+0x175/0x370 [i915]
 vm_fault_gtt+0x22d/0xf60 [i915]
 __do_fault+0x2f/0x1d0
 do_pte_missing+0x4a/0xd20
 __handle_mm_fault+0x5b0/0x790
 handle_mm_fault+0xa2/0x230
 do_user_addr_fault+0x3ea/0xa10
 exc_page_fault+0x68/0x1a0
 asm_exc_page_fault+0x26/0x30

-> #1 (>reset.backoff_srcu){}-{0:0}:
 lock_acquire+0xd8/0x2d0
 _intel_gt_reset_lock+0x57/0x330 [i915]
 guc_timestamp_ping+0x35/0x130 [i915]
 process_one_work+0x250/0x510
 worker_thread+0x4f/0x3a0
 kthread+0xff/0x130
 ret_from_fork+0x29/0x50

-> #0 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}:
 check_prev_add+0x90/0xc60
 __lock_acquire+0x1998/0x2590
 lock_acquire+0xd8/0x2d0
 __flush_work+0x74/0x530
 __cancel_work_timer+0x14f/0x1f0
 intel_guc_submission_reset_prepare+0x81/0x4b0 [i915]
 intel_uc_reset_prepare+0x9c/0x120 [i915]
 reset_prepare+0x21/0x60 [i915]
 intel_gt_reset+0x1dd/0x470 [i915]
 intel_gt_reset_global+0xfb/0x170 [i915]
 intel_gt_handle_error+0x368/0x420 [i915]
 intel_gt_debugfs_reset_store+0x5c/0xc0 [i915]
 i915_wedged_set+0x29/0x40 [i915]
 simple_attr_write_xsigned.constprop.0+0xb4/0x110
 full_proxy_write+0x52/0x80
 vfs_write+0xc5/0x4f0
 ksys_write+0x64/0xe0
 do_syscall_64+0x3c/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

other info that might help us debug this:
  Chain exists of:
   (work_completion)(&(>timestamp.work)->work) --> fs_reclaim --> 
>reset.mutex
   Possible unsafe locking scenario:
 CPU0CPU1
 
lock(>reset.mutex);
 lock(fs_reclaim);
 lock(>reset.mutex);
lock((work_completion)(&(>timestamp.work)->work));

  *** DEADLOCK ***
  3 locks held by kms_pipe_crc_ba/6415:
   #0: 888101541430 (sb_writers#15){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
   #1: 888136c7eab8 (>mutex){+.+.}-{3:3}, at: 
simple_attr_write_xsigned.constprop.0+0x47/0x110
   #2: 88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: 
intel_gt_reset+0x19e/0x470 [i915]

Signed-off-by: Zhanjun Dong 

Andrzej's r-b is missing here.


---

Please add a version to your patch and a changelog.

Thanks,
Andi


Re: [Intel-gfx] [v2] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests

2023-06-08 Thread Dong, Zhanjun


> -Original Message-
> From: Teres Alexis, Alan Previn 
> Sent: June 8, 2023 2:31 PM
> To: Dong, Zhanjun ; intel-
> g...@lists.freedesktop.org
> Cc: dri-de...@lists.freedesktop.org
> Subject: Re: [v2] drm/i915/selftest/gsc: Ensure GSC Proxy init completes
> before selftests
> 
> On Thu, 2023-06-08 at 18:14 +, Dong, Zhanjun wrote:
> > See my comments below.
> >
> > > -Original Message-
> > > From: Alan Previn 
> alan:snip
> 
> > > +static int
> > > +__wait_gsc_proxy_completed(struct drm_i915_private *i915,
> > > +unsigned long timeout_ms)
> > > +{
> > > + bool need_to_wait = (IS_ENABLED(CONFIG_INTEL_MEI_GSC_PROXY)
> > > &&
> > > +  i915->media_gt &&
> > > +  HAS_ENGINE(i915->media_gt, GSC0) &&
> > > +  intel_uc_fw_is_loadable(>media_gt-
> > > > uc.gsc.fw));
> > > +
> > > + /*
> > > +  * For gsc proxy component loading + init, we need a much longer
> > > timeout
> > > +  * than what CI selftest infrastrucutre currently uses. This longer wait
> > > +  * period depends on the kernel config and component driver load
> > > ordering
> > > +  */
> > > + if (timeout_ms < 8000)
> > > + timeout_ms = 8000;
> >
> >
> > Lgtm, just an concern about the fixed number here, shall we set the
> minimal here, or let i915_selftest.timeout_ms take control? Thus no longer
> need coding change here in the future.
> >
> > Reviewed-by: Zhanjun Dong 
> 
> Thanks Zhanjun, unfortunately, based on internal testing,
> i915_selftest.timeout_ms default is too
> low that it does occasionally timeout for CI. From experience, with a lean
> ubuntu config, it typically
> takes about ~1 seconds for the mei-gsc-sw-proxy component driver to load
> after i915 loads.
> Since CI regular unloads and reloads i915, the timeout observed ends up
> being reported as issue.
> 
> 8 seconds was based on internal testing of the worst case scenario - which
> hardly ever happens.
> We've only seen the 8 second happen when the kernel config has configs
> enabled for very many SOC IP
> drivers and component driver (seen one at least one customer config) or if
> the MTL board IFWI was only
> just reflashed (this would be a one-off 8 seconds, we suspect due to the
> firmware doing additional steps)
> 

Thanks for detailed info. The i915_selftest.timeout_ms is too low for this 
test, so it need an special minimal for itself, valid reason.
The concern I raised is at minor level. Looks good to me.

Regards,
Zhanjun


Re: [Intel-gfx] [v2] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests

2023-06-08 Thread Dong, Zhanjun
See my comments below.

> -Original Message-
> From: Alan Previn 
> Sent: May 30, 2023 1:01 PM
> To: intel-gfx@lists.freedesktop.org
> Cc: dri-de...@lists.freedesktop.org; Alan Previn
> 
> Subject: [v2] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before
> selftests
> 
> On MTL, if the GSC Proxy init flows haven't completed, submissions to the
> GSC engine will fail. Those init flows are dependent on the mei's
> gsc_proxy component that is loaded in parallel with i915 and a
> worker that could potentially start after i915 driver init is done.
> 
> That said, all subsytems that access the GSC engine today does check
> for such init flow completion before using the GSC engine. However,
> selftests currently don't wait on anything before starting.
> 
> To fix this, add a waiter function at the start of __run_selftests
> that waits for gsc-proxy init flows to complete. While implementing this,
> use an table of function pointers so its scalable to add additional
> waiter functions for future such "wait on dependency" cases that.
> 
> Difference from prior versions:
>v1: Based on internal testing, increase the timeout for gsc-proxy
>specific case to 8 seconds.
> 
> Signed-off-by: Alan Previn 
> ---
>  .../gpu/drm/i915/selftests/i915_selftest.c| 61 +++
>  1 file changed, 61 insertions(+)
> 
> 
> base-commit: 0ae4ee2c735979030a0219218081eee661606921
> 
> diff --git a/drivers/gpu/drm/i915/selftests/i915_selftest.c
> b/drivers/gpu/drm/i915/selftests/i915_selftest.c
> index 39da0fb0d6d2..77168a7a7e3f 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_selftest.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_selftest.c
> @@ -24,6 +24,8 @@
>  #include 
> 
>  #include "gt/intel_gt_pm.h"
> +#include "gt/uc/intel_gsc_fw.h"
> +
>  #include "i915_driver.h"
>  #include "i915_drv.h"
>  #include "i915_selftest.h"
> @@ -127,6 +129,63 @@ static void set_default_test_all(struct selftest *st,
> unsigned int count)
>   st[i].enabled = true;
>  }
> 
> +static int
> +__wait_gsc_proxy_completed(struct drm_i915_private *i915,
> +unsigned long timeout_ms)
> +{
> + bool need_to_wait = (IS_ENABLED(CONFIG_INTEL_MEI_GSC_PROXY)
> &&
> +  i915->media_gt &&
> +  HAS_ENGINE(i915->media_gt, GSC0) &&
> +  intel_uc_fw_is_loadable(>media_gt-
> >uc.gsc.fw));
> +
> + /*
> +  * For gsc proxy component loading + init, we need a much longer
> timeout
> +  * than what CI selftest infrastrucutre currently uses. This longer wait
> +  * period depends on the kernel config and component driver load
> ordering
> +  */
> + if (timeout_ms < 8000)
> + timeout_ms = 8000;


Lgtm, just an concern about the fixed number here, shall we set the minimal 
here, or let i915_selftest.timeout_ms take control? Thus no longer need coding 
change here in the future.

Reviewed-by: Zhanjun Dong 
> +
> + if (need_to_wait &&
> + (wait_for(intel_gsc_uc_fw_proxy_init_done(>media_gt-
> >uc.gsc),
> + timeout_ms)))
> + return -ETIME;
> +
> + return 0;
> +}
> +
> +struct __startup_waiter {
> + const char *name;
> + int (*wait_to_completed)(struct drm_i915_private *i915, unsigned
> long timeout_ms);
> +};
> +
> +static struct __startup_waiter all_startup_waiters[] = { \
> + {"gsc_proxy", __wait_gsc_proxy_completed} \
> + };
> +
> +static int __wait_on_all_system_dependencies(struct drm_i915_private
> *i915)
> +{
> + struct __startup_waiter *waiter = all_startup_waiters;
> + int count = ARRAY_SIZE(all_startup_waiters);
> + int ret;
> +
> + if (!waiter || !count || !i915)
> + return 0;
> +
> + for (; count--; waiter++) {
> + if (!waiter->wait_to_completed)
> + continue;
> + ret = waiter->wait_to_completed(i915,
> i915_selftest.timeout_ms);
> + if (ret) {
> + pr_info(DRIVER_NAME ": Pre-selftest waiter %s failed
> with %d\n",
> + waiter->name, ret);
> + return ret;
> + }
> + }
> +
> + return 0;
> +}
> +
>  static int __run_selftests(const char *name,
>  struct selftest *st,
>  unsigned int count,
> @@ -134,6 +193,8 @@ static int __run_selftests(const char *name,
>  {
>   int err = 0;
> 
> + __wait_on_all_system_dependencies(data);
> +
>   while (!i915_selftest.random_seed)
>   i915_selftest.random_seed = get_random_u32();
> 


Re: [Intel-gfx] [PATCH] drm/i915: Set wedged if enable guc communication failed

2023-03-02 Thread Dong, Zhanjun
Thanks Jani.
Updated patch sent, let me know if you have any comments.

Regards,
Zhanjun

> -Original Message-
> From: Jani Nikula 
> Sent: February 27, 2023 6:30 AM
> To: Dong, Zhanjun ; intel-
> g...@lists.freedesktop.org; dri-de...@lists.freedesktop.org
> Cc: Dong, Zhanjun 
> Subject: Re: [PATCH] drm/i915: Set wedged if enable guc communication
> failed
> 
> On Fri, 24 Feb 2023, Zhanjun Dong  wrote:
> > Add err code check for enable_communication on resume path, set
> wedged if failed.
> 
> I can see that this is *what* the code does, but the commit message should
> answer the question *why*.
> 
> >
> > Signed-off-by: Zhanjun Dong 
> > ---
> >  drivers/gpu/drm/i915/gt/intel_gt_pm.c | 5 -
> > drivers/gpu/drm/i915/gt/uc/intel_uc.c | 9 +++--
> >  2 files changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > index cef3d6f5c34e..f3bb7cbbd293 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > @@ -401,8 +401,11 @@ int intel_gt_runtime_resume(struct intel_gt *gt)
> > intel_ggtt_restore_fences(gt->ggtt);
> >
> > ret = intel_uc_runtime_resume(>uc);
> > -   if (ret)
> > +   if (ret) {
> > +   /* Set wedge if uc resume failed */
> 
> This comment is just a reiteration of the C code in English, but doesn't
> provide any useful additional information.
> 
> BR,
> Jani.
> 
> > +   intel_gt_set_wedged(gt);
> > return ret;
> > +   }
> >
> > return 0;
> >  }
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> > b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> > index 6648691bd645..d4f428acf20a 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> > @@ -698,8 +698,13 @@ static int __uc_resume(struct intel_uc *uc, bool
> enable_communication)
> > /* Make sure we enable communication if and only if it's disabled */
> > GEM_BUG_ON(enable_communication ==
> intel_guc_ct_enabled(>ct));
> >
> > -   if (enable_communication)
> > -   guc_enable_communication(guc);
> > +   if (enable_communication) {
> > +   err = guc_enable_communication(guc);
> > +   if (err) {
> > +   guc_dbg(guc, "Failed to resume, %pe", ERR_PTR(err));
> > +   return err;
> > +   }
> > +   }
> >
> > /* If we are only resuming GuC communication but not reloading
> >  * GuC, we need to ensure the ARAT timer interrupt is enabled
> 
> --
> Jani Nikula, Intel Open Source Graphics Center


Re: [Intel-gfx] [PATCH] drm/i915/guc: Check for ct enabled while waiting for response

2022-07-13 Thread Dong, Zhanjun



> -Original Message-
> From: Dixit, Ashutosh 
> Sent: July 12, 2022 3:48 PM
> To: Dong, Zhanjun 
> Cc: intel-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH] drm/i915/guc: Check for ct enabled while
> waiting for response
> 
> On Thu, 16 Jun 2022 21:50:55 -0700, Dixit, Ashutosh wrote:
> >
> > On Thu, 16 Jun 2022 15:01:59 -0700, Zhanjun Dong wrote:
> > >
> > > We are seeing error message of "No response for request". Some cases
> > > happened while waiting for response and reset/suspend action was
> triggered.
> > > In this case, no response is not an error, active requests will be
> > > cancelled.
> > >
> > > This patch will handle this condition and change the error message
> > > into debug message.
> > >
> > > Signed-off-by: Zhanjun Dong 
> > > ---
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 24
> > > ---
> > >  1 file changed, 17 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > > b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > > index f01325cd1b62..f07a7666b1ad 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> > > @@ -455,6 +455,7 @@ static int ct_write(struct intel_guc_ct *ct,
> > >
> > >  /**
> > >   * wait_for_ct_request_update - Wait for CT request state update.
> > > + * @ct:  pointer to CT
> > >   * @req: pointer to pending request
> > >   * @status:  placeholder for status
> > >   *
> > > @@ -467,9 +468,10 @@ static int ct_write(struct intel_guc_ct *ct,
> > >   * * 0 response received (status is valid)
> > >   * * -ETIMEDOUT no response within hardcoded timeout
> > >   */
> > > -static int wait_for_ct_request_update(struct ct_request *req, u32
> > > *status)
> > > +static int wait_for_ct_request_update(struct intel_guc_ct *ct,
> > > +struct ct_request *req, u32 *status)
> > >  {
> > >   int err;
> > > + bool ct_enabled;
> > >
> > >   /*
> > >* Fast commands should complete in less than 10us, so sample
> > >quickly  @@ -481,12 +483,15 @@ static int
> > >wait_for_ct_request_update(struct ct_request *req, u32 *status)
> > >  #define GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS 10
> > >  #define GUC_CTB_RESPONSE_TIMEOUT_LONG_MS 1000
> > >  #define done \
> > > - (FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) ==
> \
> > > + (!(ct_enabled = intel_guc_ct_enabled(ct)) || \
> > > +  FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) ==
> \
> > >GUC_HXG_ORIGIN_GUC)
> > >   err = wait_for_us(done, GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS);
> > >   if (err)
> > >   err = wait_for(done,
> GUC_CTB_RESPONSE_TIMEOUT_LONG_MS);
> > >  #undef done
> > > + if (!ct_enabled)
> > > + err = -ECANCELED;
> >
> > Actually here's an even simpler suggestion. We could just do:
> >
> > if (!ct_enabled)
> > CT_DEBUG(ct, "Request %#x (fence %u) cancelled as CTB is
> disabled\n",
> >...);
> >
> > And return 0 as before. This way we won't have to make any changes in
> > either ct_send() or intel_guc_ct_send(). So intel_guc_ct_enabled()
> > just serves to get us out of the wait early and prevent the -ETIMEDOUT
> > return (and 0 return avoids all the error messages we are trying to
> eliminate).
> 
> Actually will need to unlink the request too, so it will be something like:
> 
>   if (!ct_enabled) {
>   CT_DEBUG(ct, "Request %#x (fence %u) cancelled as CTB is
> disabled\n", ...);
> 
>   spin_lock_irqsave(>requests.lock, flags);
>   list_del();
>   spin_unlock_irqrestore(>requests.lock, flags);
>   }

I agree, the caller function need the err is non-zero to know the request is 
not success, and unlink the request.
The caller function ct_send will do the unlink.

For the err code ECANCELED, while in intel_guc_ct_send, it returns ENODEV if ct 
is disabled. This patch will be changed to ENODEV to match it.


Re: [Intel-gfx] [PATCH] drm/i915/guc: Check ctx while waiting for response

2022-06-14 Thread Dong, Zhanjun
Thanks for all comments, I will update code and prepare for next version.

Regards,
Zhanjun

-Original Message-
From: Dixit, Ashutosh  
Sent: June 14, 2022 12:28 PM
To: Dong, Zhanjun 
Cc: intel-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; Teres 
Alexis, Alan Previn ; Wajdeczko, Michal 

Subject: Re: [PATCH] drm/i915/guc: Check ctx while waiting for response

On Thu, 02 Jun 2022 10:21:19 -0700, Zhanjun Dong wrote:
>

Hi Zhanjun,

> We are seeing error message of "No response for request". Some cases 
> happened while waiting for response and reset/suspend action was 
> triggered. In this case, no response is not an error, active requests will be 
> cancelled.
>
> This patch will handle this condition and change the error message 
> into debug message.

IMO the patch title should be changed: which ctx are we checking while waiting 
for response? Something like "check for ct enabled while waiting for response"?

> @@ -481,12 +481,14 @@ static int wait_for_ct_request_update(struct 
> ct_request *req, u32 *status)  #define 
> GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS 10  #define 
> GUC_CTB_RESPONSE_TIMEOUT_LONG_MS 1000  #define done \
> - (FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == \
> + (!intel_guc_ct_enabled(ct) || FIELD_GET(GUC_HXG_MSG_0_ORIGIN, 
> +READ_ONCE(req->status)) == \
>GUC_HXG_ORIGIN_GUC)
>   err = wait_for_us(done, GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS);
>   if (err)
>   err = wait_for(done, GUC_CTB_RESPONSE_TIMEOUT_LONG_MS);
>  #undef done
> + if (!intel_guc_ct_enabled(ct))
> + err = -ECANCELED;

Also, I really don't like intel_guc_ct_enabled() being called in two places. Is 
there a possibility that intel_guc_ct_enabled() can return false in the first 
place (causing the wait to exit) and then return true in the second place (so 
we don't return -ECANCELED)?

Is it possible to change the status of the request to something else from
intel_guc_ct_disable() (or wherever ct->enabled is set to false) rather than 
introducing intel_guc_ct_enabled() checks here. Changing the status of the 
request when CT goes down would cause the wait's to exit here. And then we can 
check that special request status signifying CT went down?

Thanks.
--
Ashutosh